32 bit×32 bit multiprecision razor based dynamic

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 759
32 Bit×32 Bit Multiprecision Razor-Based Dynamic
Voltage Scaling Multiplier With Operands Scheduler
Xiaoxiao Zhang, Student Member, IEEE, Farid Boussaid, Senior Member, IEEE,
and Amine Bermak, Fellow, IEEE
Abstract—In this paper, we present a multiprecision (MP)
reconfigurable multiplier that incorporates variable precision,
parallel processing (PP), razor-based dynamic voltage scaling
(DVS), and dedicated MP operands scheduling to provide opti-mum
performance for a variety of operating conditions. All of
the building blocks of the proposed reconfigurable multiplier
can either work as independent smaller-precision multipliers
or work in parallel to perform higher-precision multiplications.
Given the user’s requirements (e.g., throughput), a dynamic volt-age/
frequency scaling management unit configures the multiplier
to operate at the proper precision and frequency. Adapting
to the run-time workload of the targeted application, razor
flip-flops together with a dithering voltage unit then configure
the multiplier to achieve the lowest power consumption. The
single-switch dithering voltage unit and razor flip-flops help
to reduce the voltage safety margins and overhead typically
associated to DVS to the lowest level. The large silicon area
and power overhead typically associated to reconfigurability
features are removed. Finally, the proposed novel MP multiplier
can further benefit from an operands scheduler that rearranges
the input data, hence to determine the optimum voltage and
frequency operating conditions for minimum power consumption.
This low-power MP multiplier is fabricated in AMIS 0.35-μm
technology. Experimental results show that the proposed MP
design features a 28.2% and 15.8% reduction in circuit area
and power consumption compared with conventional fixed-width
multiplier. When combining this MP design with error-tolerant
razor-based DVS, PP, and the proposed novel operands scheduler,
77.7%–86.3% total power reduction is achieved with a total
silicon area overhead as low as 11.1%. This paper successfully
demonstrates that a MP architecture can allow more aggressive
frequency/supply voltage scaling for improved power efficiency.
Index Terms—Computer arithmetic, dynamic voltage scaling,
low power design, multi-precision multiplier.
I. INTRODUCTION
CONSUMERS demand for increasingly portable yet high-performance
multimedia and communication products
imposes stringent constraints on the power consumption of
individual internal components [1]–[4]. Of these, multipliers
perform one of the most frequently encountered arithmetic
Manuscript received June 8, 2012; revised February 11, 2013; accepted
February 20, 2013. Date of publication April 18, 2013; date of current version
March 18, 2014. This work was supported in part by a grant from the HK
Research Grant Council, under Grant 610509 and the Australian Research
Council’s Discovery Projects Funding Scheme under Grant DP130104374.
X. Zhang and A. Bermak are with the Department of Electronic and
Computer Engineering, Hong Kong University of Science and Technology,
Hong Kong (e-mail: zhangxx@ust.hk; eebermak@ust.hk).
F. Boussaid is with the School of Electrical, Electronic, and Computer
Engineering, The University of Western Australia, Perth 6017, Australia
(e-mail: farid.boussaid@uwa.edu.au).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2013.2252032
operations in digital signal processors (DSPs) [4]. For embed-ded
applications, it has become essential to design more
power-aware multipliers [4]–[13]. Given their fairly complex
structure and interconnections, multipliers can exhibit a large
number of unbalanced paths, resulting in substantial glitch
generation and propagation [8], [11]. This spurious switching
activity can be mitigated by balancing internal paths through a
combination of architectural and transistor-level optimization
techniques [8], [11]. In addition to equalizing internal path
delays, dynamic power reduction can also be achieved by mon-itoring
the effective dynamic range of the input operands so
as to disable unused sections of the multiplier [6], [12] and/or
truncate the output product at the cost of reduced precision
[13]. This is possible because, in most sensor applications,
the actual inputs do not always occupy the entire magnitude
of its word-length. For example, in artificial neural network
applications, the weight precision used during the learning
phase is approximately twice that of the retrieval phase [14].
Besides, operations in lower precisions are the most frequently
required. In contrast, most of today’s full-custom DSPs and
application-specific integrated circuits (ASICs) are designed
for a fixed maximum word-length so as to accommodate the
worst case scenario. Therefore, an 8-bit multiplication com-puted
on a 32-bit Booth multiplier would result in unnecessary
switching activity and power loss.
Several works investigated this word-length optimization.
[1], [2] proposed an ensemble of multipliers of different pre-cisions,
with each optimized to cater for a particular scenario.
Each pair of incoming operands is routed to the smallest
multiplier that can compute the result to take advantage of
the lower energy consumption of the smaller circuit. This
ensemble of point systems is reported to consume the least
power but this came at the cost of increased chip area given
the used ensemble structure. To address this issue, [3], [5]
proposed to share and reuse some functional modules within
the ensemble. In [3], an 8-bit multiplier is reused for the
16-bit multiplication, adding scalability without large area
penalty. Reference [5] extended this method by implementing
pipelining to further improve the multiplier’s performance. A
more flexible approach is proposed in [15], with several mul-tiplier
elements grouped together to provide higher precisions
and reconfigurability. Reference [7] analyzed the overhead
associated to such reconfigurable multipliers. This analysis
showed that around 10%–20% of extra chip area is needed
for 8–16 bits multipliers.
Combining multiprecision (MP) with dynamic voltage scal-ing
(DVS) can provide a dramatic reduction in power con-
1063-8210 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

760 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014
sumption by adjusting the supply voltage according to circuit’s
run-time workload rather than fixing it to cater for the worst
case scenario [4]. When adjusting the voltage, the actual
performance of the multiplier running under scaled voltage
has to be characterized to guarantee a fail-safe operation.
Conventional DVS techniques consist mainly of lookup table
(LUT) and on-chip critical path replica approaches [17]–[19].
The LUT approach tunes the supply voltage according to a
predefined voltage-frequency relationship stored in a LUT,
which is formed considering worst case conditions (process
variations, power supply droops, temperature hot-spots, cou-pling
noise, and many more). Therefore, large margins are
necessarily added, which in turn significantly decrease the
effectiveness of the DVS technique. The critical path replica
approach typically involves an on-chip critical path replica to
approximate the actual critical path. Therefore, voltage could
be scaled to the extent that the replica fails to meet the timing.
However, safety margins are still needed to compensate for the
intradie delay mismatch and address fast-changing transient
effects [24]. In addition, the critical path may change as a
result of the varying supply voltage or process or tempera-ture
variations. If this occurs, computations will completely
fail regardless of the safety margins. The aforementioned
limitations of conventional DVS techniques motivated recent
research efforts into error-tolerant DVS approaches [24]–[27],
which can run-time operate the circuit even at a voltage level
at which timing errors occur. A recovery mechanism is then
applied to detect error occurrences and restore the correct data.
Because it completely removes worst case safety margins,
error-tolerant DVS techniques can further aggressively reduce
power consumption. In this paper, we propose a low power
reconfigurable multiplier architecture that combines MP with
an error-tolerant DVS approach based on razor flip-flops [25].
The main contributions of this paper can be summarized
follows.
1) A novel MP multiplier architecture featuring,
respectively, 28.2% and 15.8% reduction in silicon area
and power consumption compared with its conventional
32 × 32 bit fixed-width multiplier counterpart. All
reported multipliers trade silicon area/power
consumption for MP [7]. In this paper, silicon area is
optimized by applying an operation reduction technique
that replaces a multiplier by adders/subtractors.
2) A silicon implementation of this MP multiplier
integrating an error-tolerant razor-based dynamic DVS
approach. The fabricated chip demonstrates run-time
adaptation to the actual workload by operating at the
minimum supply voltage level and minimum clock
frequency while meeting throughput requirements. Prior
works combining MP with DVS have only considered
a limited number of offline simulated precision-voltage
pairs, with unnecessary large safety margins added to
cater for critical paths [9], [10].
3) A novel dedicated operand scheduler that rearranges
operations on input operands so as to reduce the
number of transitions of the supply voltage and, in
turn, minimize the overall power consumption of
the multiplier. Unlike reported scheduling works, the
Performance request Input data flow
Voltage and
Frequency
Management Unit
(VFMU)
Input Operands
Scheduler
(IOS)
Target voltage Target clock frequency
reference
System-on-chip
Voltage Scaling
Unit
(VSU)
Clock Frequency
Scaling Unit
(FSU/VCO)
FPGA
Multi-precision Multiplier
Scheduled
data flow
Supply voltage Operating Clock
Multiplication results
Error
feedback
reference
Fig. 1. Overall multiplier system architecture.
function of the proposed scheduler is not task scheduling
rather input operands scheduling for the proposed MP
multiplier.
The rest of this paper is organized as follows. Section II
presents the operation and architecture of the proposed MP
multiplier. Section III presents the approach used to reduce the
overhead associated to MP and reconfigurability. Section IV
presents the operating principle and implementation of the
DVS management unit. Section V presents the razor flip-flops,
which are at the heart of the DVS flow. Section VI presents
experimental results. Section VII presents the operands sched-uler
unit. Finally, a conclusion is given in Section VIII.
II. SYSTEM OVERVIEW AND OPERATION
The proposed MP multiplier system (Fig. 1) comprises five
different modules that are as follows:
1) the MP multiplier;
2) the input operands scheduler (IOS) whose function is
to reorder the input data stream into a buffer, hence to
reduce the required power supply voltage transitions;
3) the frequency scaling unit implemented using a voltage
controlled oscillator (VCO). Its function is to generate
the required operating frequency of the multiplier;
4) the voltage scaling unit (VSU) implemented using a volt-age
dithering technique to limit silicon area overhead. Its
function is to dynamically generate the supply voltage
so as to minimize power consumption;
5) the dynamic voltage/frequency management unit
(VFMU) that receives the user requirements (e.g.,
throughput).
The VFMU sends control signals to the VSU and FSU
to generate the required power supply voltage and clock
frequency for the MP multiplier.
The MP multiplier is responsible for all computations.
It is equipped with razor flip-flops that can report timing

ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 761
Fig. 2. Possible configuration modes of proposed MP multiplier.
errors associated to insufficiently high voltage supply levels.
The operation principle is as follows. Initially, the multiplier
operates at a standard supply voltage of 3.3 V. If the razor flip-flops
of the multiplier do not report any errors, this means that
the supply voltage can be reduced. This is achieved through
the VFMU, which sends control signals to the VSU, hence to
lower the supply voltage level. When the feedback provided
by the razor flip-flops indicates timing errors, the scaling of
the power supply is stopped.
The proposed multiplier (Fig. 2) not only combines MP
and DVS but also parallel processing (PP). Our multiplier
comprises 8 × 8 bit reconfigurable multipliers. These building
blocks can either work as nine independent multipliers or
work in parallel to perform one, two or three 16 × 16 bit
multiplications or a single-32 × 32 bit operation. PP can be
used to increase the throughput or reduce the supply voltage
level for low power operation.
Fig. 3 shows the benefits of the different approaches being
considered. Power consumption is a linear function of the
workload, which is normally represented by the input operands
precision. Curve 1 corresponds to the case of a fixed-precision
(FP) multiplier using a fixed power supply. Region 1 shows
the power optimization space for MP techniques, which use
different-precision multiplications to reduce power. If one
combines MP with DVS, power is further reduced with
curves (1)–(3) becoming curves (4)–(6), respectively. Regions
1 and 2 show the power optimization space for the combined
approach. Based on PP, the operating frequency could be
decreased together with the supply voltage, as shown in curves
(7) and (8). Finally, region 3 shows the optimization space for
the proposed approach, which combines MP, DVS with PP.
III. MP AND RECONFIGURABILITY OVERHEAD
Fig. 4 shows the structure of the input interface unit,
which is a submodule of the MP multiplier (Fig. 1). The
role of this input interface unit (Fig. 4) is to distribute the
input data between the nine independent processing elements
(PEs) (Fig. 2) of the 32 × 32 bit MP multiplier, considering
the selected operation mode. The input interface unit uses
an extra MSB sign bit to enable both signed and unsigned
Fig. 3. Conceptual view of optimization spaces of MP, DVS, and PP
approaches.
multiplications. A 3-bit control bus indicates whether the
inputs are 1/4/9 pair(s) of 8-bit operands, or 1/2/3 pair(s) of
16-bit operands, or 1 pair of 32-bit operands, respectively.
Depending on the selected operating mode, the input data
stream is distributed (Fig. 4) between the PEs to perform
the computation. Fig. 5 shows how three 8 × 8 bit PEs are
used to realize a 16 × 16 bit multiplier. The 32 × 32 bit
multiplier is constructed using a similar approach but requires
3 × 3 PEs. A 3-bit control word defines which PEs work
concurrently and which PEs are disabled. Whenever the full
precision (32 × 32 bit) is not exercised, the supply voltage
and the clock frequency may be scaled down according to the
actual workload.
To evaluate the overhead associated to reconfigurability and
MP, we define X and Y as the 2n-bits wide multiplicand and
multiplier, respectively. XH, YH are their respective n most
significant bits whereas XL, YL are their respective n least
significant bits. XLYL , XHYL , XLYH, XHYH is the crosswise
products. The product of X and Y can be expressed as follows:
P = (XHYH )22n + (XHYL + XLYH)2n + XLYL (1)
where 2n-bit reconfigurable multiplier can be built using
adders and four n bit × n bit multipliers to compute XHYH,
XHYL , XLYH, and XLYL . Table I shows that this would result
in overheads of 18% and 13% for the silicon area and power,
respectively. However, if we define [18]
X = XH + XL (2)
Y = YH + YL (3)
then (1) could be rewritten as follows:
P =(XHYH)22n+(XY −XHYH−XLYL )2n+XLYL . (4)
Comparing (1) and (4), we have removed one n × n bit
multiplier (for calculating XHYL or XLYH ) and one 2n-bit
adder (for calculating XHYL + XLYH). The two adders are
replaced with two n-bit adders (for calculating XH + XL and

X16_3//1[15:8]
X16_2//1[15:8]
X16_1//[15:8]
X8_9//2
X8_4//2
0
X32_1//[15:8]
MUX
X16_3//1[7:0]
X16_2//1[7:0]
Y16_1//[7:0]
X8_9//1
X8_4//1
X8_1//
X32_1//[7:0]
MUX
X32_1//[31:24]
X16_3//2[15:8]
X16_2//2[15:8]
X8_9//4
X8_4//4
0
MUX
X32_1//[23:16]
X16_3//2[7:0]
X16_2//2[7:0]
X8_9//3
X8_4//3
0
MUX
X16_3//3[15:8]
X8_9//6
0
MUX
X8_9//5
0
MUX
X8_9//8
0
MUX
X8_9//7
0
MUX
X16_3//3[7:0]
X8_9//9
0
X
MUX
Y16_3//1[15:8]
Y16_2//1[15:8]
Y16_1//[15:8]
Y8_9//2
Y8_4//2
0
Y32_1//[15:8]
MUX
Y16_3//1[7:0]
Y16_2//1[7:0]
Y16_1//[7:0]
Y8_9//1
Y8_4//1
Y8_1//
Y32_1//[7:0]
MUX
Y32_1//[31:24]
Y16_3//2[15:8]
Y16_2//2[15:8]
Y8_9//4
Y8_4//4
0
MUX
Y32_1//[23:16]
Y16_3//2[7:0]
Y16_2//2[7:0]
Y8_9//3
Y8_4//3
0
MUX
Y16_3//3[15:8]
Y8_9//6
0
MUX
Y8_9//5
0
MUX
Y8_9//8
0
MUX
Y8_9//7
0
MUX
Y16_3//3[7:0]
Y8_9//9
0
Y
MUX
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
PE9 PE8 PE7 PE6 PE5 PE4 PE3 PE2 PE1
3-bit mode
control
Fig. 4. Structure of input interface unit.
Fig. 5. Three PEs combined to form 16 × 16 bit multiplier.
YH + YL) and two (2n + 2)-bit subtractors (for calculating
XY − XHYH − XLYL ). In a 32-bit multiplier, we can thus
significantly reduce the design complexity by using two 34-bit
subtractors to replace a 16 × 16 bit multiplier. We actually
need two 16 × 16 bit multipliers (for calculating XHYH and
XLYL ) and one 17 × 17 bit multiplier (for calculating XY ).
To evaluate the proposed MP architecture, a conventional
32-bit fixed-width multiplier and four sub-block MP mul-tipliers
are designed using a Booth Radix-4 Wallace tree
structure similar to that used for the building blocks of our MP
three sub-block multiplier. These multipliers are synthesized
using the synopsys design compiler with AMIS 0.35-μm
complimentary metal-oxide-semiconductor (CMOS) standard
cell technology library. The power simulations are performed
at a clock frequency of 50 MHz and at a power supply of 3.3 V.
Table I shows the implementation results including silicon area
and power consumption for these multipliers. The proposed
MP three sub-block architecture can achieve reductions of
about 16% in power and 28% in area as compared with the
conventional 32 × 32 bit fixed-width multiplier design. The
TABLE I
AREA AND POWER COMPARISON OF PROPOSED MP MULTIPLIERS
AGAINST CONVENTIONAL FIXED-WIDTHMULTIPLIER
RUNNING AT 50 MHz
Schemes Power (mW) Area (mm2)
32-bit 39.62 0.624
fixed-width multiplier (100%) (100%)
32-bit 4 sub-block 44.76 0.736
MP multiplier (113%) (118%)
32-bit 3 sub-block 33.36 0.448
MP multiplier (84%) (72%)
latter uses a Booth radix-4Wallace tree structure similar to that
used in designing the building blocks of our MP multipliers.
However, because of its larger size, the 32 × 32 bit fixed-width
multiplier exhibits an irregular layout with complex
interconnects. This limitation of tree multipliers happens to be
addressed by our MP 32 × 32 bit multiplier, which uses a more
regular design to partition, regroup, and sum partial products.
IV. DYNAMIC VOLTAGE AND FREQUENCY
SCALING MANAGEMENT
A. DVS Unit
In our implementation (Fig. 1), a dynamic power supply and
a VCO are employed to achieve real-time dynamic voltage and
frequency scaling under various operating conditions. In [28],
near-optimal dynamic voltage scaling can be achieved when
using voltage dithering, which exhibits faster response time
than conventional voltage regulators. Voltage dithering uses
power switches to connect different supply voltages to the
load, depending on the time slots. Therefore, an intermediate
average voltage is achieved. This conventional voltage dither-ing
technique has some limitations. If the power switches
are toggled with overlapping periods, switches can be turned
on simultaneously, giving rise to a large transient current.
To mitigate this, nonoverlapping clocks could be used to
control power switches. However, this may result in system
instability as there are instances where all supply voltages
are disconnected from the load. The requirement for multiple
supplies can also result in system overhead. To address these
issues, we implemented a single-supply voltage dithering

(a)
(b)
Fig. 6. (a) Proposed single-header voltage dithering unit and voltage and
frequency tuning loops. (b) Experimental timing results from voltage dithering
unit.
scheme [Fig. 6(a)], which operates as follows. When the sup-ply
voltage (Vn) of the multiplier drops below the predefined
reference voltage (Vref), the comparator output (Va) toggles.
Therefore, the VFMU turns on the power switch via Vctrl,
for a predefined duration Tc = 5 μs. The chosen value for the
off-chip storage capacitor Cs is 4.7 μF. This value is chosen to
achieve a voltage ripple magnitude of 50 mV [Fig. 6(b)] with
a charging current set to 50 mA, hence to limit the resistive
power loss of the dithering unit to less than 1% of the total
power consumption. The value of Cs is a tradeoff between
ripple magnitude, tracking speed, and area/power overheads.
Fig. 6(b) shows experimental results for the voltage control
loop.
B. Dynamic Frequency Scaling Unit
In the proposed 32 × 32 bit MP multiplier, dynamic
frequency tuning is used to meet throughput requirements.
It is based on a VCO implemented as a seven-stage current
starved ring oscillator. The VCO output frequency can be
tuned from 5 to 50 MHz using four control bits (5 MHz/step).
This frequency range is selected to meet the requirements
of general purpose DSP applications. The reported multiplier
can operate as a 32-bit multiplier or as nine independent
8-bit multipliers. For the chosen 5–50 MHz operating range,
our multiplier boasts up to 9 × 50 = 450 MIPS. The
simulated power consumption for the VCO ranges from
Fig. 7. Experimental measurement of worst case frequency switching
(from 50 to 5 MHz).
Fig. 8. Conceptual view of razor flip-flop [25].
85 (5 MHz) to 149 μW (50 MHz), which is negligible com-pared
with the power consumed by the multiplier. Fig. 7 shows
experimental measurements showing the transient response for
the worst case frequency switching (from 50 to 5 MHz). Clock
frequency can settle within one clock cycle as required.
V. IMPLEMENTATION OF RAZOR FLIP-FLOPS
Although the worst case paths are very rarely exercised, tra-ditional
DVS approaches still maintain relatively large safety
margins to ensure reliable circuit operation, resulting in exces-sive
power dissipated. The razor technology is a breakthrough
work, which largely eliminates the safety margins by achieving
variable tolerance through in-situ timing error detection and
correction ability [25]. This approach is based on a razor
flip-flop, which detects and corrects delay errors by double
sampling. The razor flip-flop (Fig. 8) operates as a standard
positive edge triggered flip-flops coupled with a shadow latch,
which samples at the negative edge. Therefore, the input data
is given in the duration of the positive clock phase to settle
down to its correct state before being sampled by the shadow
latch. The minimum allowable supply voltage needs to be set,
hence the shadow latch (Fig. 8) always clocks the correct
data even for the worst case conditions. This requirement is
usually satisfied given that the shadow latch is clocked later
than the main flip-flop. A comparator flags a timing error
when it detects a discrepancy between the speculative data
sampled at the main flip-flop and the correct data sampled

V
C
O
Voltage
Scaling
Unit
Multiplier
Razor flip-flops
Fig. 9. Microphotograph of 32 × 32 bit MP multiplier.
at the shadow latch. The correct data would subsequently
overwrite the incorrect signal. The key idea behind razor flip-flops
is that if an error is detected at a given pipeline stage X,
then computations are only re-executed through the following
pipeline stage X + 1. This is possible because the correct
sampled value would be held by the shadow latch [25]. This
approach ensures forward progress of data through the entire
pipeline at the cost of a single-clock cycle [25].
An error correction mechanism, based on global clock
gating, is implemented in the proposed multiplier [25]. In this
correction scheme, error and clock signals are used to deter-mine
when the entire pipeline needs to be stalled for a single-clock
cycle. Fig. 1 shows that a global error signal is fed
to the VFMU so as to alert the controlling unit whenever
the current operating voltage is lower than necessary. The
VFMU will then increase the voltage reference. This will in
turn result in the VSU generating a new supply voltage level
based on the new target voltage reference. When an error
occurs, results can be recomputed at any pipeline stage using
the corresponding input of the shadow latch. Therefore, the
correct values can be forwarded to the corresponding next
stages. Given that all stages can carry out these recomputations
in parallel, the adopted global clock gating can tolerate any
number of errors within a given clock cycle [25]. After one
clock cycle, normal pipeline operation can resume. The actual
implementation of razor flip-flops requires careful design to
meet timing constraints and avoid system failure. For example,
the use of a delayed clock for the shadow latch (Fig. 8) makes
it possible for a short-path in the combinational logic to corrupt
the data in the shadow latch [25]. This imposes a short-path
delay constraint at the input of each razor flip-flop of our
multiplier. To meet these constraints across all corners, we
inserted delay buffers through all short paths found by Cadence
silicon-on-chip (SOC) Encounter and validated them through
Prime Time. In addition, precautions are used to mitigate
metastability by inserting a metastability detector at the output
of each main flip-flop. The outputs of the metastability detector
and the error comparator (Fig. 8) are ORED to generate the
error signal of individual razor flip-flops [25], [26]. These
razor error signals are OR-ED together to form a global
error signal used to ensure that all valid data in the shadow
TABLE II
PROTOTYPE CHARACTERISTICS
Technology node 0.35 μm
Die size 1.5 × 1.0 mm
Total number of transistors 37656
Measured chip power at 3.3 V 39 mW
DVS supply voltage range 0–3.3 V
DFS clock frequency range 5–50 MHz
Total number of flip-flops 144
Number of razor flip-flops 13
Standard D flip-flop power 57 μW
Razor flip-flop power
(static/switching) 70/239 μW
Total power overhead of razor flip-flops 2.3%
latches is restored into the main flip-flops before the next clock
cycle. The adopted design for the metastability detector is that
proposed in [26]. This metastability detector relies on skewed
inverters, which require careful simulation through all process
corners to ensure proper operation [26].
When implementing razor-based DVS, it is essential that the
resulting power/delay overhead be kept to a minimum, hence
not to severely limit the benefits brought by aggressive supply
voltage scaling. In the case of our multiplier, only 13 out of a
total of 144 flip-flops that is 9% of the flip-flops are found not
to meet timing constraints under worst case level of the supply
voltage (Table II). Therefore, only these 13 critical paths are
equipped with razor flip-flops. These 13 near-critical paths
are identified through Cadence SOC Encounter and validated
using Prime Time. At a supply voltage of 3.3 V and operating
frequency of 50 MHz, the razor flip-flop is found to consume
1.2 times more static/switching power (70/57 μW) when no
timing errors are detected. In the other case, it consumes 4.2
times more static/switching power (239/57 μW). However, for
a conservative activity factor of 1%, the power overhead due
to razor flip-flops was estimated to be less than 2.3% of the
nominal chip power because only 9% of the flip-flops were
made razor flip-flops. Therefore, both the silicon area and
power overheads associated to razor flip-flops are found to
be negligible. In regard to the razor flip-flop’s delay overhead,
it is mainly because of the additional multiplexer at its input as
well as the increased fan-out resulting from the introduction of
comparator, metastability detector, and OR gates at the output.
At a supply voltage of 3.3 V and operating frequency of
50 MHz, delay overheads are found to be 1.20% and 3.58% for
error-free and error-occurring cases, respectively. These delay
overheads constitute a small penalty for the massive power
reduction enabled by razor-based DVS.
VI. PERFORMANCE EVALUATION AND DISCUSSION
We designed and fabricated a 32 × 32 bit reconfigurable
multiplier in AMIS 0.35-μm technology. The die photograph
of the multiplier system is shown in Fig. 9 and the chip
characteristics are shown in Table II. The operating mode

32 bit mode
16 bit mode
8 bit mode
5 10 15 20 25 30 35 40 45 50
2.5
2.0
1.5
1.0
Minimum Voltage (V)
Frequency (MHz)
Fig. 10. Experimental results of minimum voltage supply for different
precisions and operating frequencies.
of the multiplier is controlled by three external signals.
The operating voltage and frequency are tuned automatically
depending on the actual workload of the multiplier. The chip
is tested by feeding in randomly generated operands and
comparing the outputs with results from a PC processing the
same data. The 32-bit precision data sets include data with
an effective word-length of 17–32 bits. The 16-bit precision
data sets and 8-bit precision data sets include data with an
effective word-length of 9–16 bits and 0–8 bits, respectively.
We achieved full functionality across a voltage range of
0.8–3.3 V, and a frequency range of 5–50 MHz. Fig. 10
shows the relation between the minimum supply voltage
and operating frequency for different precision modes. As
explained, razor energy savings are a result of the elimination
of safety margins and processing below the first failure voltage.
By scaling the voltage below the first failure point, an error
rate of 0.1% is maintained and the power consumption is
measured at this minimum possible voltage. For an operating
frequency of 50 MHz, the supply voltage is set to 2.45, 1.95,
and 1.80 V for the 32, 16, and 8-bit modes, respectively.
For lower operating frequencies, the required supply voltage
levels are much lower, as shown in Fig. 10. The chip power
consumption for different operating modes is shown in Fig. 11.
For 16-bit operands, 55.6% (17.35 versus 39.04 mW) power
reduction can be obtained by the MP scheme. When the
DVS technique is applied, the chip consumes 6.06 mW at
the first failure point at an optimal 0.1% error rate, leading
to a further 65.1% (6.06 versus 17.35 mW) power saving.
Based on PP feature enabled, the operating frequency can
be scaled to 1/3 of the original one, therefore the voltage
would be tuned down to a much lower level for an additional
46.7% (3.23 versus 6.06 mW) power reduction. For 8-bits
operands, the MP, DVS, and PP schemes can help save 87.4%
(4.90 versus 39.04 mW), 70.2% (1.46 versus 4.90 mW), and
55.5% (0.65 versus 1.46 mW) power, respectively.
Fig. 12 shows experimental results showing the power sav-ings
associated to the MP, razor-based DVS, and PP features
40
35
30
25
20
15
10
5
0
8-b mode
with DVS and PP
16-b mode
without DVS
8-b mode
with DVS
8-b mode
without DVS
16-b mode
with DVS
and PP
16-b mode
with DVS
32-b mode
Without MP
nor DVS nor PP
MP with DVS
with PP
MP MP with DVS
Power Consumption (mW)
Fig. 11. Experimental results of power consumption of different operating
schemes.
Fig. 12. Experimental data showing power optimization spaces associated
to MP, razor-based DVS, and PP schemes.
of the fabricated 32 × 32 bit multiplier. Region 1 is the power
optimization space for MP whereas Regions 2(a) and (b) are
the power optimization spaces for the DVS technique without
and with razor, respectively. Finally, region 3 is the power
optimization space for PP. Fig. 12 shows that when MP is
combined with DVS, power consumption is reduced to 29.12,
8.07, and 1.98 mW (points , , and in Fig. 12) for 32,
16, and 8-bit multiplications, respectively. In addition, razor
flip-flops help reduce the operating voltage to the minimum
possible level, resulting in a further power reduction of 26.1%
(from 29.12 to 21.52 mW, point to , 24.9% (from 8.07 to
6.06 mW, point to , and 26.3% (from 1.98 to 1.46 mW,
point to ) for 32, 16, and 8-bit precision, respectively.
Based on PP, the power reduction space is further enlarged.
Table III compares the performance of the fabricated prototype
with related works. [7], [20], [21] correspond to FP voltage
schemes whereas [5], [22], [23] are MP, fixed voltage schemes.
[9], [10] are MP, multivoltage schemes. To compare the silicon
area associated to each scheme, we chose to use the number of

transistors because it constitutes a fair metric to compare dif-ferent
CMOS technology nodes. From Table III, the proposed
multiplier provides the most reconfigurability while exhibit-ing
the smallest relative area. Compared with the designs
with the same maximum word-length of 32-bit [21], [23],
our design boasts a much smaller area. For design [7] and
design [5], their maximum word-length is 16-bit instead of
32-bit. If we assume that a 32 × 32 bit multiplier is built
using three 16 × 16 bit multipliers, then the area of the 32-bit
multiplier is at least three times that of the 16-bit multiplier,
discarding the glue and reconfigurability logic. This shows that
the proposed multiplier outperforms reported implementations
whether considering silicon area or reconfigurability. In regard
to power dissipation, Table III shows normalized power results
(using P = CV2 f α0−1) to cater for the different technolo-gies
previously reported. As in previous works, we consider
random input test patterns, with activity factors determined
using models describing the propagation of the input statistics
to the output of data-path operators [29]. Normalized power
results show that the proposed multiplier outperforms reported
implementations in terms of power dissipation. In previous
works, flexibility and reconfigurability have come at a cost of
increased silicon area and power consumption. In this paper,
we propose an implementation that not only provides MP
reconfigurable datapath, but also obtains a reduction in both
silicon area and power, as compared with FP multipliers. In
more advanced deep submicrometer processes, the proposed
MP multiplier with razor DVS offers the ability to compensate
for process variations. It would also be essential to integrate
leakage reduction techniques [30], hence to jointly minimize
leakage and dynamic power consumption.
VII. INPUT OPERANDS SCHEDULER
A. Motivation and Operation Principle
In previous section, we report experimental results obtained
using different data sets, each composed of randomly
generated single-precision operands. However, in some
applications such as artificial neural network applications, the
input data stream could include mixed-precision operands [1].
Although our multiplier provides three different precision
modes (32 × 32 bit, 16 × 16 bit, 8 × 8 bit), the supply
voltage would still have to transit dynamically between the
minimum required voltage levels Vmin32, Vmin16, or Vmin8
required for 32, 16, and 8-bit operands, respectively. Fig. 10
shows that given a certain operating frequency, the differ-ence
among Vmin32, Vmin16, and Vmin8 can be in the range
of 0.1–0.65 V. If the input data stream requires frequent
supply voltage transitions, significant dynamic power would
be dissipated, thereby undermining the benefits of DVS.
In addition, these transitions may not always be possible within
one clock cycle. To minimize the overall power consumption,
one needs to reduce the number of supply voltage transitions
while still processing operands at the minimum required
voltage level. To address this problem, we propose an IOS
that will perform the following tasks: 1) reorder the input
data stream such that same-precision operands are grouped
together into a buffer (Fig. 13) and 2) find the minimum supply
voltages (Vmin32, Vmin16, Vmin8), and operating frequencies
( f32, f16, f8) for the three different-precision data groups to
minimize the overall power consumption while still meeting
the specified throughput.
The block diagram of the IOS is shown in Fig. 13. It is
composed of an operand range detector, a pattern generation
engine, a 2 k-bit buffer-(RAM), and a frequency/voltage ana-lyzer.
The scheduler operates as follows. The inputs operands
are first sent to the range detector, which classifies them
according to their precision: 32, 16, or 8-bit. The classified
data is then grouped by the pattern generation engine, which
packs same-precision data into three different 32-bit data
patterns (Fig. 13): 1) pattern 1 corresponds to original 32-bit
input operand Data; 2) pattern 2 combines two 16-bit operands
data (with their redundant 16 MSBs removed); and 3) pattern
3 combines four 8-bit operand data (with their redundant
24 MSBs removed). At each clock cycle, a 32-bit data pattern
can be processed, owning to the PP capability of the proposed
multiplier. This resembles the SIMD structure, and helps to put
the MP and PP capability into real effect. As in Fig. 13, the
three different data patterns are counted (N32, N16, and N8)
and stored into a buffer, together with the respective voltages
and clock frequencies at which they should be processed. For
each full buffer, there will only be two transitions needed:
(Vmin8, f8)–(Vmin16, f16), and (Vmin16, f16)–(Vmin32, f32).
To limit the silicon area overhead, we chose a 2k-RAM,
which can store 60 32-bit data patterns. The voltage/frequency
analyzer specifies the values of Vmin32, Vmin16, Vmin8, f32, f16,
and f8 to the dithering unit and VCO. The Vmin– f pairs are
determined during the characterization of the chip and stored
in the LUT (Fig. 13).
B. Problem Formulation
Given a random mixed-precision (32-, 16-, or 8-bit) input
data stream and specified throughput Tp, our goal is to
determine the voltages (Vmin32, Vmin16, Vmin8) and frequencies
( f32, f16 and f8) at which each precision data group should be
processed such that the total power consumption is minimized.
In the following analysis, we consider the following four
components of the total power consumption: 1) the resis-tive
power loss Pdith_resistic_loss of the dithering unit; 2) the
switching power loss Pdith_switching_loss of the dithering unit;
3) the dynamic power consumption Pcomputation associated to
the multiplication computation; and 4) finally, Pcompu_overhead
that corresponds to the power consumption of the latter
computation when carried out at voltage levels higher than
the nominal Vmin. The equations of the aforementioned four
components of the total power consumption are given below
Pdith_resistic_loss = I 2
char Ron (5)
where Ichar is the charge current of the dithering unit, and Ron
is the equivalent resistance of the dithering switch
2 f
N
Pdith_switching_loss = CgVdd
(6)
where Cg is the gate capacitance of the dithering switch, Vdd
is the 3.3 V standard voltage, and N is the number of input

TABLE III
PERFORMANCE COMPARISON OF PROPOSED MULTIPLIERWITH RELATED WORKS
Pattern 1 32-b ope
Pattern 2
Pattern 3
8-b ope 8-b ope 8-b ope 8-b ope
Range
Detector
Pattern
Generation
Engine
Buffer
(RAM)
Vmin-freqency
Look-up-table
Voltage
Frequency
Analyzer
Input
Operands
Stream
Dithering
Unit
VCO
Multi-precision
Multiplier
32-b ope
32-b ope
..
32-b ope
16-b ope 16-b ope
16-b ope 16-b ope
16-b ope 16-b ope
..
16-b ope 16-b ope
..
Pattern 1 data
Pattern 2 data
Pattern 3 data
16-b ope 16-b ope
Algorithm
A
Algorithm
B
Algorithm
C
Fig. 13. Block diagram of IOS.
data patterns
Pcomputation = CmVmin
2 f (7)
where Cm is the effective capacitance of the multiplier, Vmin
is the applied minimum supply voltage, and f is the applied
operating frequency
Pcompu_overhead = Pdt
T
= CmV 2 f dV
T
(8)
where V is the dithering unit output, which fluctuates around
Vmin, and T is the charge time period, which is inversely
proportional to the operating frequency.
The overall power consumption is thus given by
Poverall = Pcomputation + Pcompu_overhead
+Pdith_resistic_loss + Pdith_switching_loss. (9)
In the following, we present three different algorithms
to reduce this overall power consumption. Each of these
algorithms constitutes a different approach to process the
mixed-precision data held in the operands buffer (Fig. 13). The
performance of each algorithm is evaluated using a mixed-precision
data set of 120 000 randomly operands, with a
third corresponding to each precision (8-, 16-, and 32-bit).
In the following, the specified throughput Tp for the proposed

Fig. 14. Operation principles of operand scheduling algorithms A, B, and C. Data Block X and Data Block X+1 refer to two-consecutive operand data
blocks subsequently stored into the RAM, respectively.
TABLE IV
DETAILED POWER PERFORMANCE OF DIFFERENT SCHEDULING ALGORITHMS
Algorithm P_computation P_compu_overhead P_dith_resistic_loss P_dith_switching_loss P_overall
A 3.034 mW 3.159 mW 0.059 mW 1.715 mW 8.255 mW
B 2.266 mW 2.565 mW 0.084 mW 1.663 mW 6.578 mW
C 1.682 mW 1.843 mW 0.062 mW 0.975 mW 4.561 mW
32 × 32 bit multiplier is 64 F (Mbits/s), where F is the
multiplier’s operating frequency.
C. Algorithm A
In the first algorithm, the multiplier throughput Tp = 64 F is
kept constant by fixing the operating frequencies ( f32−, f16−,
or f8) of each precision-data group (32-, 16-, or 8-bit) to
f32 = F, f16 = F
2
, f8 = F
4
(10)
where F is the multiplier’s operating frequency. This is
because the throughput in 8 × 8 bit multiplication mode is four
times that of the 32 × 32 bit multiplication mode and double
that of the 16 × 16 bit multiplication mode, as a result of the
multiplier PP. The minimum supply voltage (Vmin32, Vmin16
or Vmin8) associated to each operating frequency ( f32, f16
or f8) is determined through a Vmin– f LUT. Algorithm A
shows its limitations when 32-bit operands are processed
initially. As shown in Fig. 14, once all N32 operands of the data
block are processed, the supply voltage (Vn) needs to decrease
rapidly from point A (Vmin32) to point B (Vmin16) at which all
N16 16-bit operands of the data block should be processed.
If N16 is too small, most 16-bit operands will be actually
processed in Sections A and B, that is at a voltage possibly
much higher than the minimal Vmin16 level. Similarly 8-bit
operands of the data block could be processed in Sections C
and D, B-C, or even A-B for the worst case. This contributes to
increasing Pcompu_overhead. The overall power performance of
algorithm A is shown in Table IV. Compared with the fixed-width
32 × 32 bit standard multiplier (32 × 32 bit mode
must be chosen given that a third of operands are 32-bit),
77.7% total power reduction is achieved with a total silicon
area overhead of only 11.1%, when considering DVS, razor,
RAM, and dedicated circuitry for scheduling algorithm A.
D. Algorithm B
This algorithm removes all transitions of the power supply
voltage by making Vmin32, Vmin16, and Vmin8 equal and adjust-ing
f32, f16, and f8 such that the overall throughput is kept
unchanged. We thus need to have the following:
64N32 + 128N16 + 256N8
N32
f32
+ N16
f16
+ N8
f8
= 64 F. (11)
From a LUT, we can obtain the Vmin– f relationship as
follows:
Vmin32 = ψ32( f32) (12)
Vmin16 = ψ16( f16) (13)
Vmin8 = ψ8( f8). (14)
As algorithm B keeps the supply voltage constant
ψ32( f32) = ψ16( f16) = ψ8( f8) = V (15)

the operating frequencies f32, f16, and f8 can be determined
by using (11) and (15). For example, when F is set to
50 MHz, the values for V , f32, f16, and f8 are found to
be 1.35 V, 20 MHz, 25 MHz, and 35 MHz, respectively.
The overall power consumption of algorithm B is shown in
Table IV. Due to the complete removal of voltage transitions,
the Pcompu_overhead is reduced. Simultaneously, because of
holistic planning, the dynamic computation power is also
optimized to a lower level. Compared with the fixed-width
32 × 32 bit standard multiplier, 81.5% power reduction is
achieved with a total silicon area overhead of only 11.9%,
when considering DVS, razor, RAM, and dedicated circuitry
for scheduling Algorithm B.
E. Algorithm C
Although Algorithm B removes power supply voltage tran-sitions
by setting a single-voltage level V, there may be
better power saving combinations of power supply voltages
and operating frequencies: (Vmin32, f32), (Vmin16, f16), and
(Vmin8, f8). The aim of algorithm C is to find such an optimum
for reduced power consumption. To limit complexity, we will
only seek to minimize the dynamic power dissipated as a result
of the computation
P = CV2 f (16)
= Cm32V2 min32 f32 + Cm16V2 min16 f16 + Cm8V 2
min8 f8 (17)
= χ( f32, f16). (18)
Given that the Vmin– f relationships are known (12)–(14),
one could find the minimum of the above equation for the
specified throughput (11). For example, when F is set to
50 MHz, the values for (Vmin32, f32), (Vmin16, f16), (Vmin8, f8)
are found to be (1.15 V, 15 MHz), (1.30 V, 20 MHz), and
(1.75 V, 45 MHz), respectively. The overall power perfor-mance
of algorithm C is shown in Table IV. When consid-ering
DVS, razor, RAM, and dedicated scheduling circuitry,
algorithm B exhibits the least power consumption, with an
overall power reduction of 86.3%, compared with the standard
32 × 32 bit fixed-width multiplier. However, it requires two
additional dithering units to generate all three discrete power
supply levels Vmin32, Vmin16, and Vmin8 and thus remove
transitions among these different supply levels. This increases
the total silicon area overhead to 27.1%. Therefore, algorithm
B provides the most attractive tradeoff with 81.5% reduction
and a silicon area overheard of just 11.9%.
VIII. CONCLUSION
We proposed a novel MP multiplier architecture featuring,
respectively, 28.2% and 15.8% reduction in silicon area and
power consumption compared with its 32 × 32 bit conven-tional
fixed-width multiplier counterpart.When integrating this
MP multiplier architecture with an error-tolerant razor-based
DVS approach and the proposed novel operands scheduler,
77.7%–86.3% total power reduction was achieved with a total
silicon area overhead as low as 11.1%. The fabricated chip
demonstrated run-time adaptation to the actual workload by
operating at the minimum supply voltage level and mini-mum
clock frequency while meeting throughput requirements.
The proposed novel dedicated operand scheduler rearranges
operations on input operands, hence to reduce the number of
transitions of the supply voltage and, in turn, minimized the
overall power consumption of the multiplier. The proposed MP
razor-based DVS multiplier provided a solution toward achiev-ing
full computational flexibility and low power consumption
for various general purpose low-power applications.
ACKNOWLEDGMENT
The authors would like to thank Dr. M.K. Law for his
comments and discussions.We also would like to acknowledge
Mr. S.F. Luk for his help with the chip test measurements.
REFERENCES
[1] R. Min, M. Bhardwaj, S.-H. Cho, N. Ickes, E. Shih, A. Sinha, A. Wang,
and A. Chandrakasan, “Energy-centric enabling technologies for wire-less
sensor networks,” IEEE Wirel. Commun., vol. 9, no. 4, pp. 28–39,
Aug. 2002.
[2] M. Bhardwaj, R. Min, and A. Chandrakasan, “Quantifying and enhanc-ing
power awareness of VLSI systems,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 9, no. 6, pp. 757–772, Dec. 2001.
[3] A. Wang and A. Chandrakasan, “Energy-aware architectures for a real-valued
FFT implementation,” in Proc. IEEE Int. Symp. Low Power
Electron. Design, Aug. 2003, pp. 360–365.
[4] T. Kuroda, “Low power CMOS digital design for multimedia proces-sors,”
in Proc. Int. Conf. VLSI CAD, Oct. 1999, pp. 359–367.
[5] H. Lee, “A power-aware scalable pipelined booth multiplier,” in Proc.
IEEE Int. SOC Conf., Sep. 2004, pp. 123–126.
[6] S.-R. Kuang and J.-P. Wang, “Design of power-efficient configurable
booth multiplier,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57,
no. 3, pp. 568–580, Mar. 2010.
[7] O. A. Pfander, R. Hacker, and H.-J. Pfleiderer, “A multiplexer-based
concept for reconfigurable multiplier arrays,” in Proc. Int. Conf. Field
Program. Logic Appl., vol. 3203. Sep. 2004, pp. 938–942.
[8] F. Carbognani, F. Buergin, N. Felber, H. Kaeslin, and W. Fichtner,
“Transmission gates combined with level-restoring CMOS gates reduce
glitches in low-power low-frequency multipliers,” IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 16, no. 7, pp. 830–836, Jul. 2008.
[9] T. Yamanaka and V. G. Moshnyaga, “Reducing multiplier energy by
data-driven voltage variation,” in Proc. IEEE Int. Symp. Circuits Syst.,
May 2004, pp. 285–288.
[10] W. Ling and Y. Savaria, “Variable-precision multiplier for equalizer with
adaptive modulation,” in Proc. 47th Midwest Symp. Circuits Syst., vol. 1.
Jul. 2004, pp. I-553–I-556.
[11] K.-S. Chong, B.-H. Gwee, and J. S. Chang, “A micropower low-voltage
multiplier with reduced spurious switching,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 13, no. 2, pp. 255–265, Feb. 2005.
[12] M. Sjalander, M. Drazdziulis, P. Larsson-Edefors, and H. Eriks-son,
“A low-leakage twin-precision multiplier using reconfigurable
power gating,” in Proc. IEEE Int. Symp. Circuits Syst., May 2005,
pp. 1654–1657.
[13] S.-R. Kuang and J.-P. Wang, “Design of power-efficient pipelined
truncated multipliers with various output precision,” IET Comput. Digital
Tech., vol. 1, no. 2, pp. 129–136, Mar. 2007.
[14] J. L. Holt and J.-N. Hwang, “Finite precision error analysis of neural
network hardware implementations,” IEEE Trans. Comput., vol. 42,
no. 3, pp. 281–290, Mar. 1993.
[15] A. Bermak, D. Martinez, and J.-L. Noullet, “High-density 16/8/4-bit
configurable multiplier,” Proc. Inst. Electr. Eng. Circuits Devices Syst.,
vol. 144, no. 5, pp. 272–276, Oct. 1997.
[16] T. Kuroda, “Low power CMOS digital design for multimedia proces-sors,”
in Proc. Int. Conf. VLSI CAD, Oct. 1999, pp. 359–367.
[17] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen,
“A dynamic voltage scaled microprocessor system,” IEEE J. Solid-State
Circuits, vol. 35, no. 11, pp. 1571–1580, Nov. 2000.
[18] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano,
A. Chiba, Y. Watanabe, K. Matsuda, T. Maeda, T. Sakurai, and
T. Furuyama, “Variable supply-voltage scheme for low-power high-speed
CMOS digital design,” IEEE J. Solid-State Circuits, vol. 33, no. 3,
pp. 454–462, Mar. 1998.

[19] M. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo,
A. Hashiguchi, H. Kawahara, K. Kumano, and M. Shimura, “Dynamic
voltage and frequency management for a low-power embedded micro-processor,”
IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 28–35,
Jan. 2005.
[20] J.-Y. Kang and J.-L. Gaudiot, “A simple high-speed multiplier design
computers,” IEEE Trans. Comput., vol. 55, no. 10, pp. 1253–1258,
Oct. 2006.
[21] G. Y. Jeong, J. S. Park, and H. C. Kang, “A Study on multiplier
architecture optimized for 32-bit processor with 3-stage pipeline,” in
Proc. Int. SoC Design Conf., Oct. 2004, pp. 656–660.
[22] S. Perri, P. Corsonello, M. A. Iachino, M. Lanuzza, and G. Cocorullo,
“Variable precision arithmetic circuits for FPGA-based multimedia
processors,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12,
no. 9, pp. 995–999, Sep. 2004.
[23] S. D. Haynes, A. Ferrari, and P. Y. K. Cheung, “Flexible reconfigurable
multiplier blocks suitable for enhancing the architecture of FPGAs,” in
Proc. IEEE Custom Integr. Circuits, May 1999, pp. 191–194.
[24] S. Das, D. Blaauw, D. Bull, K. Flautner, and R. Aitken, “Addressing
design margins through error-tolerant circuits,” in Proc. Design Autom.
Conf., Jul. 2009, pp. 11–12.
[25] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler,
D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “Razor: A low-power
pipeline based on circuit-level timing speculation,” in Proc. Int. Symp.
Microarchit., Dec. 2003, pp. 7–18.
[26] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, T. Mudge, and
K. Flautner, “A self-tuning DVS processor using delay-error detection
and correction,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 792–804,
Apr. 2006.
[27] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai,
D. M. Bull, and D. T. Blaauw, “RazorII: In situ error detection and
correction for PVT and SER tolerance,” IEEE J. Solid-State Circuits,
vol. 44, no. 1, pp. 32–48, Jan. 2009.
[28] B. Calhoun and A. Chandrakasan, “Ultra-dynamic voltage scaling using
sub-threshold operation and local voltage dithering in 90 nm CMOS,”
in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2005,
pp. 300–301.
[29] E. D. Kyriakis-Bitzaros and S. Nikolaidis, “Estimation of bit-level tran-sition
activity in datapaths based on word-level statistics and conditional
entropy,” IEE Proc. Circuits, Devices Syst., vol. 149, no. 4, pp. 234–240,
Aug. 2002.
[30] A. Youssef, M. Anis, and M. Elmasry, “A comparative study between
static and dynamic sleep signal generation techniques for leakage
tolerant designs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 16, no. 9, pp. 1114–1126, Sep. 2008.
Xiaoxiao Zhang (S’06) received the B.S. degree
from the Department of Microelectronics, Tianjin
University, Tianjin, China, and the M.E. degree from
the Institute of Microelectronics, Chinese Academy
of Sciences, Beijing, China, in 2003 and 2006,
respectively. She is currently pursuing the Ph.D.
degree with the Electronic and Computer Engineer-ing
Department, Hong Kong University of Science
and Technology, Hong Kong. Her Ph.D. research
work involves the design of low-power real-time
digital image processing (DIP) cores or modules for
a camera-on-a-chip.
Her current research interests include low-power and high-performance
VLSI circuits design, signal processing architectures, face detection, and 3-D
object/face recognition.
Farid Boussaid (M’00–SM’04) received the M.S.
and Ph.D. degrees in microelectronics from the
National Institute of Applied Science (INSA),
Toulouse, France, in 1996 and 1999, respectively.
He joined Edith Cowan University, Perth, Aus-tralia,
as a Postdoctoral Research Fellow, and
a member of the Visual Information Processing
Research Group in 2000. He joined the University
of Western Australia, Crawley, Australia, in 2005,
where he is currently an Associate Professor.
His current research interests include smart CMOS
vision sensors, gas sensors, neuromorphic systems, device simulation, mod-eling,
and characterization in deep submicron CMOS processes.
Amine Bermak (M’99–SM’04–F’13) received the
M.Eng. and Ph.D. degrees in electronic engineering
from Paul Sabatier University, Toulouse, France, in
1994 and 1998, respectively.
He joined the Advanced Computer Architecture
Research Group, York University, York, U.K., where
he was working as a Post-Doctoral Fellow on
VLSI implementation of CMM neural network for
vision applications in a project funded by British
Aerospace. He joined Edith Cowan University,
Perth, Australia, in 1998, first as a Research Fellow
working on smart vision sensors, then as a Lecturer and a Senior Lecturer.
He is currently a Professor with the Electronic and Computer Engineering
Department, Hong Kong University of Science and Technology (HKUST),
Hong Kong. His current research interests include VLSI circuits and systems
for signal, image processing, sensors, and microsystems applications.
Dr. Bermak was a recipient of many distinguished awards, including the
2004 “IEEE Chester Sall Award,” the HKUST “Engineering School Teaching
Excellence Award” in 2004 and 2009, and the “Best Paper Award” at the 2005
International Workshop on System-On-Chip for Real-Time Applications.

32 bit×32 bit multiprecision razor based dynamic

More Related Content

What's hot (17)

Similar to 32 bit×32 bit multiprecision razor based dynamic (20)

32 bit×32 bit multiprecision razor based dynamic