Low Power Architecture for JPEG2000

Low Power Architecture for
JPEG 2000

Dr. P. R. Panda Rahul Jain
Associate Professor 2004JVL2433
IIT-Delhi M.Tech (VDTT)
IIT-Delhi
S. Krishnakumar
Cypress Semiconductor
Bangalore

Agenda
 JPEG2000 and 2-D DWT
 Memory Power Optimization
 Existing 2D-DWT Scan Based Architectures
 Proposed Architectures
 Low Power Z-Scan
 Low Power Block Scan
 Optimization and Pipelining Exploration for 2D-DWT
 Proposed DFG Optimization
 Pipeline Study

JPEG2000 Computation Blocks

 Pre-processing (Image Tiling)
 Discrete Wavelet Transform
 Quantization
 Tier-1 Coding (EBCOT)
 Tier-2 Coding (File Formatting and Packing)

Discrete Wavelet Transform
 2D wavelet transform:
 1st:1D wavelet transform to all rows
 2nd:1D wavelet transform to all columns
 Each Row/Column can be computed
independently
LL HL

LL HL LL HL
LH HH

Image

LH HH LH HH

1-Level DWT 2-Level DWT

Importance of Optimizing Memory System
Energy
 Many emerging media applications like
JPEG2000 are data intensive
 For ASICs and embedded systems, memory
system can contribute up to 90% energy
 Multiple memories exist in a SoC design

Optimization approaches
 Fixed memory access patterns
 Optimize memory architecture
 Fixed memory architecture
 Optimize memory access patterns
 Concurrently optimize Memory Architecture and
Accesses
 Highest Potential
 Algorithm Level
 Reduce memory requirement
 Improve regularity of accesses
 Build optimized memory architecture
 Memory Partitioning
 Custom Circuits
 Option Explored in this Work

Memory Partitioning

 Partition the memory array into smaller banks
so that only the addressed bank is activated
 improves speed and lowers power
 bit line capacitance reduced
 number of bit cells activated reduced
 At some point the delay and power overhead
associated with the bank decoding circuit
dominates (2 to 8 banks typical)

2D-DWT Architectures
 Direct
 Line Based
 Z-Scan
 Optimal Z-Scan (Ref:Optimal data transfer and buffering schemes
for JPEG2000 encoder, Mu-Yu Chiu; Kun-Bin Lee; Chein-Wei Jen; Signal
Processing Systems, 2003. SIPS 2003. IEEE Workshop on 27-29 Aug.
2003 Page(s):177 – 182)

Direct DWT

 Straightforward Architecture
 First Read the Image Row wise computing
Row-wise 1-D DWT
 Then Read the Image Column wise
computing Column-wise 1-D DWT
 No On-Chip Buffer Required
 Reads + Writes to Off-Chip Memory =
2MN+2MN (M =Image Tile ht, N = Image Tile wd)

Data Dependency in (9,7)DWT

0 1 2 3 4 5 6 7 8 X(i)

1 3 5 7 Y(2i+1)

0 2 4 6 8 Y(2i)

1 3 5 7 Z(2i+1)

0 2 4 6 8 Z(2i)

Line-Based DWT
 Read pixels line by line
 Keep the min required number of lines in
memory
 Row Operation gets full line data
 Column operation is activated as it gets
Column data to reduce buffer
 On-Chip Buffer Required = 6*N
MN+MN (M =Image Tile ht, N = Image Tile wd)

Z-Scan DWT
 Do a Z-Scan instead of Line by Line Scan
 Column Processing can start early
 On-Chip Buffer Required = 4*M
MN+MN (M =Image Tile ht, N = Image Tile wd)

Optimal Z-Scan
 Considers the Code-Block size (CW*CH) required by
Encoding Block in the next phase

• On-Chip Buffer Required
= 4*M+4*2*CW
• Reads + Writes to
Off-Chip Memory
= MN+MN
(M =Image Tile ht, N = Image Tile
wd) 2* CH

2* CW

Low Power Z-Scan
 Compute r elements in a row before starting
with the next row
 For Z Scan r =1
 For Optimal Z-Scan r = 2*CW
r r
• On-Chip Buffer Required =
4*M+4*2*CW
• Reads + Writes to Off-Chip
2*CH
Memory = MN+MN
(M =Image Tile ht, N = Image Tile wd)

Low Power Z-Scan
 r will be a sub-integral multiple of 2*CW
 This considers the Code Block Size
 No of Wakeups to the Column Buffer Banks depend
on r
 Large Value of r not desirable
 Between the resumption of a row computation and
storing back of intermediate values after calculating
r row elements the buffer can go into a Low Power
state
 Large Value of r is desirable
 Access to the buffers
 Row Buffer = 2 per ‘r’ element computation
 Column Buffer = 1 per element computation

Low Power Block Scan
 Extend the concept of ‘r’ for column processing also
 Reduces the access to column buffer from 1 per
element to 2/s per element
 To maintain the throughput introduce 2 Transpose
Buffers (TB1 & TB2) r

 Transpose Buffer Accesses
s B1 B3
 Row Processor Writes
 Column Processor Reads
 i.e 2 access per element
 TB must be much smaller s
B2 B4
than Column Buffer

Working: Low Power Block Scan
 2D-DWT computed in blocks of r*s
 Step 1: Row Processor (RP) computes 1D-DWT on B1
and writes into TB1
 Step 2: Column Processor (CP) computes 1D-DWT on
the data in TB1 (B1) and RP computes on B2 and
writes into TB2
 Similarly RP and CP RP:
TB1
CP: RP:
TB1
CP:
B1 B3 B2
alternate between TB2 TB2

TB1 and TB2
TB1 TB1
RP: CP: RP: CP:
B2 B1 B4 B3
TB2 TB2

B: Block, RP/CP: Row/Column Processor, TB: Transpose Buffer

Memory Power Analysis
 Memory can be in 3 modes
 Active (Read/Write being done) P (n)
a
 Standby (No Access being done) P
Standby(n)
 Sleep Mode (Data Retention Mode and Cannot Access) P (n)
Sleep
 To Access from this mode, first wakeup the memory
 Wakeup incurs energy penalty PWakeup(n)
 Let ‘T’ be the minimum clock cycles for the memory to be in sleep mode to
get any power advantage
 To account for memory banking overhead, multiplexer power
considered
 P (i,j) be the power for a i:1 multiplexer of bit width j
Mux
 Assumption: on-chip memory access latency to fit into the clock
period equal to 15ns
 Power values refer to average power dissipation per coefficient
computation for the corresponding memory component

Row and Column Buffer Power
 With 4-Stage pipelined DWT,10 16-bit registers need to
be stored/transferred incase of suspension/resumption
of line computation
 Row Buffer
 Size = 160*M (M: Ht of Image Tile)
 ‘b’ banks, each having 160 column and M/b rows
 One b:1 Mux of 160 bits required
 Column Buffer
 Size = 160*2*CW (CW: EBCOT code block width, usually 128)
 ‘c’ banks, each having 160 column and 2*CW/c rows
 One c:1 Mux of 160 bits required
 Column Buffer Power analysis Similar to Row Buffer
Power analysis

Row Buffer Power
 Accesses to Row Buffer
 2 per ‘r’ element ie 2/r per element computation
 Only one Bank active at a time, others in Sleep Mode
 Row Buffer Power is:
 Prow= [2*Pa(M/b)+Pmux(b,160)+(r-2)*Ps(M/b)]/r +
Psleep(M/b)* (b-1)
 Ps = Psleep if (r-2) >= ‘T’ else Ps = Pstandby
 Due to sequential access to the Row Buffer each
Bank is woken up Once
 Total Row Buffer Power
 PTotal_Row = Prow + [Pw(M/b) * b/(M*r) ]

Transpose Buffer Power
 2 buffers required of size r*s*16 bits partitioned into ‘d’ banks
each
 Access and No of Wakeups
 RP: Sequential Order hence d wakeups for r*s elements

 CP: Sequential Order, but in jumps of r elements
 CP reads s elements from d banks
 Each bank has s/d elements
 If s-s/d > ‘T’, then put banks in Sleep mode and no of wakeups per
element = d/s
 Power
 If (s-s/d >= T) P
Buffer = 2* Pa(r*s/d) + Mux Power + 2*(d-1) *
Psleep(r*s/d)
Else PBuffer =2* Pa(r*s/d) + Mux Power + (d-1) * Psleep(r*s/d)+ (d-1) *
Pstandby(r*s/d)
 Mux Power = P
mux (d,16) ) + Pmux (2,16)
 Wakeup Power = P (r*s/d) * P
w Buffer_Wake

Memory Architecture
 Row and Column Buffers
 Used as Circular FIFOs
 Replace General Row Decoder with Custom Circuit for
Addressing
 Similar observation for Transpose Buffer
 Custom Row Decoder Log (n) Bit
Counter
Log (n)

Row Decoder
n

 Counter and a Decoder
 Circular Shift Register (CSR)
 Flip Flop corresponding to the accessed row stores ‘1’
 A lot of power dissipated at FF clock pins
 Proposed Power Efficient CSR
 During shifting only 2 FF
need to be enabled
 Use Clock Gating for others

Comparison of 3 Row Decoders
3000 45000
40000
2500 Power Comparison 35000
Area Comparison
2000 30000
Power(uW)

Area (um^2)
25000
1500
20000
1000 15000
10000
500
5000
0 0
8 16 32 64 128 256 512 8 16 32 64 128 256 512
Bits Bits

CSR ClockGated CSR Cntr+RD CSR ClockGated CSR Cntr+RD

 Proposed Row Decoder is up to 90% and
84% power efficient compared to CSR and
Cntr+Decoder
 Area Penalty of about 15%

Memory Energy Modeling
 Active Energy modeled using eCACTI
 eCACTI models leakage current also
 Models Cache Power
 Modified to get SRAM power
 Standby Energy
 IStandby = 1.83 nA at Vdd = 1V [Qin05]
 Sleep Mode Energy
 ISleep = 0.55 nA at Vdd = 0.49V [Qin05]
 Wakeup Energy
 Ewakeup = 0.57 fJ * no of bits in SRAM
H. Qin, et.al, "Standy supply voltage minimization for deep sub-micron
SRAM", IEEE Microelectronics Journal, Aug 2005, vol. 36, pp. 789-800

Architecture Comparison

 8 Banks for row and column buffer in all the 3
architectures
 Low Power Block Scan
 r =16 and s = 16

Low Power Architecture for JPEG2000

Optimization and Pipeline
Exploration

4 Stage Pipelining
 Critical Path is Ta + Tm
 Initiation Interval =1,
Resource Requirement
 4 Multipliers
 8 Adders
 11 Registers
 6 Pipelining Registers
 4 for e1-e4
 1 for Z4
 Initiation Interval =2
Resource Requirement
 2 Multipliers
 4 Adders
 9 Registers

Reducing Scaling Step Multipliers
 After Each1D DWT, multiply Low Pass Coeffs with k
and High Pass with 1/k
 Delay the De-Interleaving of coefficients to save
75% Multiplications
 With Throughput of 2,
1 multiplication per cycle,
hence 1 multiplier required
 Other Architectures require
4 multipliers, 2 each for
row and column processor

Pipeline Study
 Optimized DFG pipelined from 2-Stages to 8-
Stages
 Study done to get the most power efficient
strategy
 Impact of Pipelining on Clock Network Power
also Accounted

Clock Tree Power Model

 H-Tree Network Assumed
 Buffer Energy also considered
 No of levels increase with
increasing registers
 More Interconnect
 More Buffers

http://www.acsel-lab.com/Projects/detclocking/power_comparison.htm

Energy Components of Different Pipeline
Schemes

Conclusion
 “Low-Power Z-Scan” and “Low Power
Block Scan” derived using different memory
subsystem optimization techniques
 Optimizing the memory subsystem can result
in up to 90% power savings
 1D-DWT DFG optimization proposed
 4-Stage pipelining on the optimized DFG is
most energy efficient pipelined architecture

Thank You
 “A Power-Efficient Architecture for the 2-D
Discrete Wavelet Transform”, Submitted to IEEE
VLSI Design and Test Symposium, 2006
 “Memory Architecture Exploration for Power-
Efficient 2D-Discrete Wavelet Transform”,
Submitted to CODES+ISSS 2006
 “Optimization and Pipeline Exploration of 2D-
Discrete Wavelet Transform”, Submitted to
CASES 2006

Low Power Architecture for JPEG2000

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Low Power Architecture for JPEG2000 (20)

Recently uploaded (20)

Low Power Architecture for JPEG2000