SlideShare a Scribd company logo
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 759 
32 Bit×32 Bit Multiprecision Razor-Based Dynamic 
Voltage Scaling Multiplier With Operands Scheduler 
Xiaoxiao Zhang, Student Member, IEEE, Farid Boussaid, Senior Member, IEEE, 
and Amine Bermak, Fellow, IEEE 
Abstract—In this paper, we present a multiprecision (MP) 
reconfigurable multiplier that incorporates variable precision, 
parallel processing (PP), razor-based dynamic voltage scaling 
(DVS), and dedicated MP operands scheduling to provide opti-mum 
performance for a variety of operating conditions. All of 
the building blocks of the proposed reconfigurable multiplier 
can either work as independent smaller-precision multipliers 
or work in parallel to perform higher-precision multiplications. 
Given the user’s requirements (e.g., throughput), a dynamic volt-age/ 
frequency scaling management unit configures the multiplier 
to operate at the proper precision and frequency. Adapting 
to the run-time workload of the targeted application, razor 
flip-flops together with a dithering voltage unit then configure 
the multiplier to achieve the lowest power consumption. The 
single-switch dithering voltage unit and razor flip-flops help 
to reduce the voltage safety margins and overhead typically 
associated to DVS to the lowest level. The large silicon area 
and power overhead typically associated to reconfigurability 
features are removed. Finally, the proposed novel MP multiplier 
can further benefit from an operands scheduler that rearranges 
the input data, hence to determine the optimum voltage and 
frequency operating conditions for minimum power consumption. 
This low-power MP multiplier is fabricated in AMIS 0.35-μm 
technology. Experimental results show that the proposed MP 
design features a 28.2% and 15.8% reduction in circuit area 
and power consumption compared with conventional fixed-width 
multiplier. When combining this MP design with error-tolerant 
razor-based DVS, PP, and the proposed novel operands scheduler, 
77.7%–86.3% total power reduction is achieved with a total 
silicon area overhead as low as 11.1%. This paper successfully 
demonstrates that a MP architecture can allow more aggressive 
frequency/supply voltage scaling for improved power efficiency. 
Index Terms—Computer arithmetic, dynamic voltage scaling, 
low power design, multi-precision multiplier. 
I. INTRODUCTION 
CONSUMERS demand for increasingly portable yet high-performance 
multimedia and communication products 
imposes stringent constraints on the power consumption of 
individual internal components [1]–[4]. Of these, multipliers 
perform one of the most frequently encountered arithmetic 
Manuscript received June 8, 2012; revised February 11, 2013; accepted 
February 20, 2013. Date of publication April 18, 2013; date of current version 
March 18, 2014. This work was supported in part by a grant from the HK 
Research Grant Council, under Grant 610509 and the Australian Research 
Council’s Discovery Projects Funding Scheme under Grant DP130104374. 
X. Zhang and A. Bermak are with the Department of Electronic and 
Computer Engineering, Hong Kong University of Science and Technology, 
Hong Kong (e-mail: zhangxx@ust.hk; eebermak@ust.hk). 
F. Boussaid is with the School of Electrical, Electronic, and Computer 
Engineering, The University of Western Australia, Perth 6017, Australia 
(e-mail: farid.boussaid@uwa.edu.au). 
Color versions of one or more of the figures in this paper are available 
online at http://ieeexplore.ieee.org. 
Digital Object Identifier 10.1109/TVLSI.2013.2252032 
operations in digital signal processors (DSPs) [4]. For embed-ded 
applications, it has become essential to design more 
power-aware multipliers [4]–[13]. Given their fairly complex 
structure and interconnections, multipliers can exhibit a large 
number of unbalanced paths, resulting in substantial glitch 
generation and propagation [8], [11]. This spurious switching 
activity can be mitigated by balancing internal paths through a 
combination of architectural and transistor-level optimization 
techniques [8], [11]. In addition to equalizing internal path 
delays, dynamic power reduction can also be achieved by mon-itoring 
the effective dynamic range of the input operands so 
as to disable unused sections of the multiplier [6], [12] and/or 
truncate the output product at the cost of reduced precision 
[13]. This is possible because, in most sensor applications, 
the actual inputs do not always occupy the entire magnitude 
of its word-length. For example, in artificial neural network 
applications, the weight precision used during the learning 
phase is approximately twice that of the retrieval phase [14]. 
Besides, operations in lower precisions are the most frequently 
required. In contrast, most of today’s full-custom DSPs and 
application-specific integrated circuits (ASICs) are designed 
for a fixed maximum word-length so as to accommodate the 
worst case scenario. Therefore, an 8-bit multiplication com-puted 
on a 32-bit Booth multiplier would result in unnecessary 
switching activity and power loss. 
Several works investigated this word-length optimization. 
[1], [2] proposed an ensemble of multipliers of different pre-cisions, 
with each optimized to cater for a particular scenario. 
Each pair of incoming operands is routed to the smallest 
multiplier that can compute the result to take advantage of 
the lower energy consumption of the smaller circuit. This 
ensemble of point systems is reported to consume the least 
power but this came at the cost of increased chip area given 
the used ensemble structure. To address this issue, [3], [5] 
proposed to share and reuse some functional modules within 
the ensemble. In [3], an 8-bit multiplier is reused for the 
16-bit multiplication, adding scalability without large area 
penalty. Reference [5] extended this method by implementing 
pipelining to further improve the multiplier’s performance. A 
more flexible approach is proposed in [15], with several mul-tiplier 
elements grouped together to provide higher precisions 
and reconfigurability. Reference [7] analyzed the overhead 
associated to such reconfigurable multipliers. This analysis 
showed that around 10%–20% of extra chip area is needed 
for 8–16 bits multipliers. 
Combining multiprecision (MP) with dynamic voltage scal-ing 
(DVS) can provide a dramatic reduction in power con- 
1063-8210 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
760 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 
sumption by adjusting the supply voltage according to circuit’s 
run-time workload rather than fixing it to cater for the worst 
case scenario [4]. When adjusting the voltage, the actual 
performance of the multiplier running under scaled voltage 
has to be characterized to guarantee a fail-safe operation. 
Conventional DVS techniques consist mainly of lookup table 
(LUT) and on-chip critical path replica approaches [17]–[19]. 
The LUT approach tunes the supply voltage according to a 
predefined voltage-frequency relationship stored in a LUT, 
which is formed considering worst case conditions (process 
variations, power supply droops, temperature hot-spots, cou-pling 
noise, and many more). Therefore, large margins are 
necessarily added, which in turn significantly decrease the 
effectiveness of the DVS technique. The critical path replica 
approach typically involves an on-chip critical path replica to 
approximate the actual critical path. Therefore, voltage could 
be scaled to the extent that the replica fails to meet the timing. 
However, safety margins are still needed to compensate for the 
intradie delay mismatch and address fast-changing transient 
effects [24]. In addition, the critical path may change as a 
result of the varying supply voltage or process or tempera-ture 
variations. If this occurs, computations will completely 
fail regardless of the safety margins. The aforementioned 
limitations of conventional DVS techniques motivated recent 
research efforts into error-tolerant DVS approaches [24]–[27], 
which can run-time operate the circuit even at a voltage level 
at which timing errors occur. A recovery mechanism is then 
applied to detect error occurrences and restore the correct data. 
Because it completely removes worst case safety margins, 
error-tolerant DVS techniques can further aggressively reduce 
power consumption. In this paper, we propose a low power 
reconfigurable multiplier architecture that combines MP with 
an error-tolerant DVS approach based on razor flip-flops [25]. 
The main contributions of this paper can be summarized 
follows. 
1) A novel MP multiplier architecture featuring, 
respectively, 28.2% and 15.8% reduction in silicon area 
and power consumption compared with its conventional 
32 × 32 bit fixed-width multiplier counterpart. All 
reported multipliers trade silicon area/power 
consumption for MP [7]. In this paper, silicon area is 
optimized by applying an operation reduction technique 
that replaces a multiplier by adders/subtractors. 
2) A silicon implementation of this MP multiplier 
integrating an error-tolerant razor-based dynamic DVS 
approach. The fabricated chip demonstrates run-time 
adaptation to the actual workload by operating at the 
minimum supply voltage level and minimum clock 
frequency while meeting throughput requirements. Prior 
works combining MP with DVS have only considered 
a limited number of offline simulated precision-voltage 
pairs, with unnecessary large safety margins added to 
cater for critical paths [9], [10]. 
3) A novel dedicated operand scheduler that rearranges 
operations on input operands so as to reduce the 
number of transitions of the supply voltage and, in 
turn, minimize the overall power consumption of 
the multiplier. Unlike reported scheduling works, the 
Performance request Input data flow 
Voltage and 
Frequency 
Management Unit 
(VFMU) 
Input Operands 
Scheduler 
(IOS) 
Target voltage Target clock frequency 
reference 
System-on-chip 
Voltage Scaling 
Unit 
(VSU) 
Clock Frequency 
Scaling Unit 
(FSU/VCO) 
FPGA 
Multi-precision Multiplier 
Scheduled 
data flow 
Supply voltage Operating Clock 
Multiplication results 
Error 
feedback 
reference 
Fig. 1. Overall multiplier system architecture. 
function of the proposed scheduler is not task scheduling 
rather input operands scheduling for the proposed MP 
multiplier. 
The rest of this paper is organized as follows. Section II 
presents the operation and architecture of the proposed MP 
multiplier. Section III presents the approach used to reduce the 
overhead associated to MP and reconfigurability. Section IV 
presents the operating principle and implementation of the 
DVS management unit. Section V presents the razor flip-flops, 
which are at the heart of the DVS flow. Section VI presents 
experimental results. Section VII presents the operands sched-uler 
unit. Finally, a conclusion is given in Section VIII. 
II. SYSTEM OVERVIEW AND OPERATION 
The proposed MP multiplier system (Fig. 1) comprises five 
different modules that are as follows: 
1) the MP multiplier; 
2) the input operands scheduler (IOS) whose function is 
to reorder the input data stream into a buffer, hence to 
reduce the required power supply voltage transitions; 
3) the frequency scaling unit implemented using a voltage 
controlled oscillator (VCO). Its function is to generate 
the required operating frequency of the multiplier; 
4) the voltage scaling unit (VSU) implemented using a volt-age 
dithering technique to limit silicon area overhead. Its 
function is to dynamically generate the supply voltage 
so as to minimize power consumption; 
5) the dynamic voltage/frequency management unit 
(VFMU) that receives the user requirements (e.g., 
throughput). 
The VFMU sends control signals to the VSU and FSU 
to generate the required power supply voltage and clock 
frequency for the MP multiplier. 
The MP multiplier is responsible for all computations. 
It is equipped with razor flip-flops that can report timing
ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 761 
Fig. 2. Possible configuration modes of proposed MP multiplier. 
errors associated to insufficiently high voltage supply levels. 
The operation principle is as follows. Initially, the multiplier 
operates at a standard supply voltage of 3.3 V. If the razor flip-flops 
of the multiplier do not report any errors, this means that 
the supply voltage can be reduced. This is achieved through 
the VFMU, which sends control signals to the VSU, hence to 
lower the supply voltage level. When the feedback provided 
by the razor flip-flops indicates timing errors, the scaling of 
the power supply is stopped. 
The proposed multiplier (Fig. 2) not only combines MP 
and DVS but also parallel processing (PP). Our multiplier 
comprises 8 × 8 bit reconfigurable multipliers. These building 
blocks can either work as nine independent multipliers or 
work in parallel to perform one, two or three 16 × 16 bit 
multiplications or a single-32 × 32 bit operation. PP can be 
used to increase the throughput or reduce the supply voltage 
level for low power operation. 
Fig. 3 shows the benefits of the different approaches being 
considered. Power consumption is a linear function of the 
workload, which is normally represented by the input operands 
precision. Curve 1 corresponds to the case of a fixed-precision 
(FP) multiplier using a fixed power supply. Region 1 shows 
the power optimization space for MP techniques, which use 
different-precision multiplications to reduce power. If one 
combines MP with DVS, power is further reduced with 
curves (1)–(3) becoming curves (4)–(6), respectively. Regions 
1 and 2 show the power optimization space for the combined 
approach. Based on PP, the operating frequency could be 
decreased together with the supply voltage, as shown in curves 
(7) and (8). Finally, region 3 shows the optimization space for 
the proposed approach, which combines MP, DVS with PP. 
III. MP AND RECONFIGURABILITY OVERHEAD 
Fig. 4 shows the structure of the input interface unit, 
which is a submodule of the MP multiplier (Fig. 1). The 
role of this input interface unit (Fig. 4) is to distribute the 
input data between the nine independent processing elements 
(PEs) (Fig. 2) of the 32 × 32 bit MP multiplier, considering 
the selected operation mode. The input interface unit uses 
an extra MSB sign bit to enable both signed and unsigned 
Fig. 3. Conceptual view of optimization spaces of MP, DVS, and PP 
approaches. 
multiplications. A 3-bit control bus indicates whether the 
inputs are 1/4/9 pair(s) of 8-bit operands, or 1/2/3 pair(s) of 
16-bit operands, or 1 pair of 32-bit operands, respectively. 
Depending on the selected operating mode, the input data 
stream is distributed (Fig. 4) between the PEs to perform 
the computation. Fig. 5 shows how three 8 × 8 bit PEs are 
used to realize a 16 × 16 bit multiplier. The 32 × 32 bit 
multiplier is constructed using a similar approach but requires 
3 × 3 PEs. A 3-bit control word defines which PEs work 
concurrently and which PEs are disabled. Whenever the full 
precision (32 × 32 bit) is not exercised, the supply voltage 
and the clock frequency may be scaled down according to the 
actual workload. 
To evaluate the overhead associated to reconfigurability and 
MP, we define X and Y as the 2n-bits wide multiplicand and 
multiplier, respectively. XH, YH are their respective n most 
significant bits whereas XL, YL are their respective n least 
significant bits. XLYL , XHYL , XLYH, XHYH is the crosswise 
products. The product of X and Y can be expressed as follows: 
P = (XHYH )22n + (XHYL + XLYH)2n + XLYL (1) 
where 2n-bit reconfigurable multiplier can be built using 
adders and four n bit × n bit multipliers to compute XHYH, 
XHYL , XLYH, and XLYL . Table I shows that this would result 
in overheads of 18% and 13% for the silicon area and power, 
respectively. However, if we define [18] 
X = XH + XL (2) 
Y  = YH + YL (3) 
then (1) could be rewritten as follows: 
P =(XHYH)22n+(XY −XHYH−XLYL )2n+XLYL . (4) 
Comparing (1) and (4), we have removed one n × n bit 
multiplier (for calculating XHYL or XLYH ) and one 2n-bit 
adder (for calculating XHYL + XLYH). The two adders are 
replaced with two n-bit adders (for calculating XH + XL and
762 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 
X16_3//1[15:8] 
X16_2//1[15:8] 
X16_1//[15:8] 
X8_9//2 
X8_4//2 
0 
X32_1//[15:8] 
MUX 
X16_3//1[7:0] 
X16_2//1[7:0] 
Y16_1//[7:0] 
X8_9//1 
X8_4//1 
X8_1// 
X32_1//[7:0] 
MUX 
X32_1//[31:24] 
X16_3//2[15:8] 
X16_2//2[15:8] 
X8_9//4 
X8_4//4 
0 
MUX 
X32_1//[23:16] 
X16_3//2[7:0] 
X16_2//2[7:0] 
X8_9//3 
X8_4//3 
0 
MUX 
X16_3//3[15:8] 
X8_9//6 
0 
MUX 
X8_9//5 
0 
MUX 
X8_9//8 
0 
MUX 
X8_9//7 
0 
MUX 
X16_3//3[7:0] 
X8_9//9 
0 
X 
MUX 
Y16_3//1[15:8] 
Y16_2//1[15:8] 
Y16_1//[15:8] 
Y8_9//2 
Y8_4//2 
0 
Y32_1//[15:8] 
MUX 
Y16_3//1[7:0] 
Y16_2//1[7:0] 
Y16_1//[7:0] 
Y8_9//1 
Y8_4//1 
Y8_1// 
Y32_1//[7:0] 
MUX 
Y32_1//[31:24] 
Y16_3//2[15:8] 
Y16_2//2[15:8] 
Y8_9//4 
Y8_4//4 
0 
MUX 
Y32_1//[23:16] 
Y16_3//2[7:0] 
Y16_2//2[7:0] 
Y8_9//3 
Y8_4//3 
0 
MUX 
Y16_3//3[15:8] 
Y8_9//6 
0 
MUX 
Y8_9//5 
0 
MUX 
Y8_9//8 
0 
MUX 
Y8_9//7 
0 
MUX 
Y16_3//3[7:0] 
Y8_9//9 
0 
Y 
MUX 
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 
PE9 PE8 PE7 PE6 PE5 PE4 PE3 PE2 PE1 
3-bit mode 
control 
Fig. 4. Structure of input interface unit. 
Fig. 5. Three PEs combined to form 16 × 16 bit multiplier. 
YH + YL) and two (2n + 2)-bit subtractors (for calculating 
XY  − XHYH − XLYL ). In a 32-bit multiplier, we can thus 
significantly reduce the design complexity by using two 34-bit 
subtractors to replace a 16 × 16 bit multiplier. We actually 
need two 16 × 16 bit multipliers (for calculating XHYH and 
XLYL ) and one 17 × 17 bit multiplier (for calculating XY ). 
To evaluate the proposed MP architecture, a conventional 
32-bit fixed-width multiplier and four sub-block MP mul-tipliers 
are designed using a Booth Radix-4 Wallace tree 
structure similar to that used for the building blocks of our MP 
three sub-block multiplier. These multipliers are synthesized 
using the synopsys design compiler with AMIS 0.35-μm 
complimentary metal-oxide-semiconductor (CMOS) standard 
cell technology library. The power simulations are performed 
at a clock frequency of 50 MHz and at a power supply of 3.3 V. 
Table I shows the implementation results including silicon area 
and power consumption for these multipliers. The proposed 
MP three sub-block architecture can achieve reductions of 
about 16% in power and 28% in area as compared with the 
conventional 32 × 32 bit fixed-width multiplier design. The 
TABLE I 
AREA AND POWER COMPARISON OF PROPOSED MP MULTIPLIERS 
AGAINST CONVENTIONAL FIXED-WIDTHMULTIPLIER 
RUNNING AT 50 MHz 
Schemes Power (mW) Area (mm2) 
32-bit 39.62 0.624 
fixed-width multiplier (100%) (100%) 
32-bit 4 sub-block 44.76 0.736 
MP multiplier (113%) (118%) 
32-bit 3 sub-block 33.36 0.448 
MP multiplier (84%) (72%) 
latter uses a Booth radix-4Wallace tree structure similar to that 
used in designing the building blocks of our MP multipliers. 
However, because of its larger size, the 32 × 32 bit fixed-width 
multiplier exhibits an irregular layout with complex 
interconnects. This limitation of tree multipliers happens to be 
addressed by our MP 32 × 32 bit multiplier, which uses a more 
regular design to partition, regroup, and sum partial products. 
IV. DYNAMIC VOLTAGE AND FREQUENCY 
SCALING MANAGEMENT 
A. DVS Unit 
In our implementation (Fig. 1), a dynamic power supply and 
a VCO are employed to achieve real-time dynamic voltage and 
frequency scaling under various operating conditions. In [28], 
near-optimal dynamic voltage scaling can be achieved when 
using voltage dithering, which exhibits faster response time 
than conventional voltage regulators. Voltage dithering uses 
power switches to connect different supply voltages to the 
load, depending on the time slots. Therefore, an intermediate 
average voltage is achieved. This conventional voltage dither-ing 
technique has some limitations. If the power switches 
are toggled with overlapping periods, switches can be turned 
on simultaneously, giving rise to a large transient current. 
To mitigate this, nonoverlapping clocks could be used to 
control power switches. However, this may result in system 
instability as there are instances where all supply voltages 
are disconnected from the load. The requirement for multiple 
supplies can also result in system overhead. To address these 
issues, we implemented a single-supply voltage dithering
ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 763 
(a) 
(b) 
Fig. 6. (a) Proposed single-header voltage dithering unit and voltage and 
frequency tuning loops. (b) Experimental timing results from voltage dithering 
unit. 
scheme [Fig. 6(a)], which operates as follows. When the sup-ply 
voltage (Vn) of the multiplier drops below the predefined 
reference voltage (Vref), the comparator output (Va) toggles. 
Therefore, the VFMU turns on the power switch via Vctrl, 
for a predefined duration Tc = 5 μs. The chosen value for the 
off-chip storage capacitor Cs is 4.7 μF. This value is chosen to 
achieve a voltage ripple magnitude of 50 mV [Fig. 6(b)] with 
a charging current set to 50 mA, hence to limit the resistive 
power loss of the dithering unit to less than 1% of the total 
power consumption. The value of Cs is a tradeoff between 
ripple magnitude, tracking speed, and area/power overheads. 
Fig. 6(b) shows experimental results for the voltage control 
loop. 
B. Dynamic Frequency Scaling Unit 
In the proposed 32 × 32 bit MP multiplier, dynamic 
frequency tuning is used to meet throughput requirements. 
It is based on a VCO implemented as a seven-stage current 
starved ring oscillator. The VCO output frequency can be 
tuned from 5 to 50 MHz using four control bits (5 MHz/step). 
This frequency range is selected to meet the requirements 
of general purpose DSP applications. The reported multiplier 
can operate as a 32-bit multiplier or as nine independent 
8-bit multipliers. For the chosen 5–50 MHz operating range, 
our multiplier boasts up to 9 × 50 = 450 MIPS. The 
simulated power consumption for the VCO ranges from 
Fig. 7. Experimental measurement of worst case frequency switching 
(from 50 to 5 MHz). 
Fig. 8. Conceptual view of razor flip-flop [25]. 
85 (5 MHz) to 149 μW (50 MHz), which is negligible com-pared 
with the power consumed by the multiplier. Fig. 7 shows 
experimental measurements showing the transient response for 
the worst case frequency switching (from 50 to 5 MHz). Clock 
frequency can settle within one clock cycle as required. 
V. IMPLEMENTATION OF RAZOR FLIP-FLOPS 
Although the worst case paths are very rarely exercised, tra-ditional 
DVS approaches still maintain relatively large safety 
margins to ensure reliable circuit operation, resulting in exces-sive 
power dissipated. The razor technology is a breakthrough 
work, which largely eliminates the safety margins by achieving 
variable tolerance through in-situ timing error detection and 
correction ability [25]. This approach is based on a razor 
flip-flop, which detects and corrects delay errors by double 
sampling. The razor flip-flop (Fig. 8) operates as a standard 
positive edge triggered flip-flops coupled with a shadow latch, 
which samples at the negative edge. Therefore, the input data 
is given in the duration of the positive clock phase to settle 
down to its correct state before being sampled by the shadow 
latch. The minimum allowable supply voltage needs to be set, 
hence the shadow latch (Fig. 8) always clocks the correct 
data even for the worst case conditions. This requirement is 
usually satisfied given that the shadow latch is clocked later 
than the main flip-flop. A comparator flags a timing error 
when it detects a discrepancy between the speculative data 
sampled at the main flip-flop and the correct data sampled
764 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 
V 
C 
O 
Voltage 
Scaling 
Unit 
Multiplier 
Razor flip-flops 
Fig. 9. Microphotograph of 32 × 32 bit MP multiplier. 
at the shadow latch. The correct data would subsequently 
overwrite the incorrect signal. The key idea behind razor flip-flops 
is that if an error is detected at a given pipeline stage X, 
then computations are only re-executed through the following 
pipeline stage X + 1. This is possible because the correct 
sampled value would be held by the shadow latch [25]. This 
approach ensures forward progress of data through the entire 
pipeline at the cost of a single-clock cycle [25]. 
An error correction mechanism, based on global clock 
gating, is implemented in the proposed multiplier [25]. In this 
correction scheme, error and clock signals are used to deter-mine 
when the entire pipeline needs to be stalled for a single-clock 
cycle. Fig. 1 shows that a global error signal is fed 
to the VFMU so as to alert the controlling unit whenever 
the current operating voltage is lower than necessary. The 
VFMU will then increase the voltage reference. This will in 
turn result in the VSU generating a new supply voltage level 
based on the new target voltage reference. When an error 
occurs, results can be recomputed at any pipeline stage using 
the corresponding input of the shadow latch. Therefore, the 
correct values can be forwarded to the corresponding next 
stages. Given that all stages can carry out these recomputations 
in parallel, the adopted global clock gating can tolerate any 
number of errors within a given clock cycle [25]. After one 
clock cycle, normal pipeline operation can resume. The actual 
implementation of razor flip-flops requires careful design to 
meet timing constraints and avoid system failure. For example, 
the use of a delayed clock for the shadow latch (Fig. 8) makes 
it possible for a short-path in the combinational logic to corrupt 
the data in the shadow latch [25]. This imposes a short-path 
delay constraint at the input of each razor flip-flop of our 
multiplier. To meet these constraints across all corners, we 
inserted delay buffers through all short paths found by Cadence 
silicon-on-chip (SOC) Encounter and validated them through 
Prime Time. In addition, precautions are used to mitigate 
metastability by inserting a metastability detector at the output 
of each main flip-flop. The outputs of the metastability detector 
and the error comparator (Fig. 8) are ORED to generate the 
error signal of individual razor flip-flops [25], [26]. These 
razor error signals are OR-ED together to form a global 
error signal used to ensure that all valid data in the shadow 
TABLE II 
PROTOTYPE CHARACTERISTICS 
Technology node 0.35 μm 
Die size 1.5 × 1.0 mm 
Total number of transistors 37656 
Measured chip power at 3.3 V 39 mW 
DVS supply voltage range 0–3.3 V 
DFS clock frequency range 5–50 MHz 
Total number of flip-flops 144 
Number of razor flip-flops 13 
Standard D flip-flop power 57 μW 
Razor flip-flop power 
(static/switching) 70/239 μW 
Total power overhead of razor flip-flops 2.3% 
latches is restored into the main flip-flops before the next clock 
cycle. The adopted design for the metastability detector is that 
proposed in [26]. This metastability detector relies on skewed 
inverters, which require careful simulation through all process 
corners to ensure proper operation [26]. 
When implementing razor-based DVS, it is essential that the 
resulting power/delay overhead be kept to a minimum, hence 
not to severely limit the benefits brought by aggressive supply 
voltage scaling. In the case of our multiplier, only 13 out of a 
total of 144 flip-flops that is 9% of the flip-flops are found not 
to meet timing constraints under worst case level of the supply 
voltage (Table II). Therefore, only these 13 critical paths are 
equipped with razor flip-flops. These 13 near-critical paths 
are identified through Cadence SOC Encounter and validated 
using Prime Time. At a supply voltage of 3.3 V and operating 
frequency of 50 MHz, the razor flip-flop is found to consume 
1.2 times more static/switching power (70/57 μW) when no 
timing errors are detected. In the other case, it consumes 4.2 
times more static/switching power (239/57 μW). However, for 
a conservative activity factor of 1%, the power overhead due 
to razor flip-flops was estimated to be less than 2.3% of the 
nominal chip power because only 9% of the flip-flops were 
made razor flip-flops. Therefore, both the silicon area and 
power overheads associated to razor flip-flops are found to 
be negligible. In regard to the razor flip-flop’s delay overhead, 
it is mainly because of the additional multiplexer at its input as 
well as the increased fan-out resulting from the introduction of 
comparator, metastability detector, and OR gates at the output. 
At a supply voltage of 3.3 V and operating frequency of 
50 MHz, delay overheads are found to be 1.20% and 3.58% for 
error-free and error-occurring cases, respectively. These delay 
overheads constitute a small penalty for the massive power 
reduction enabled by razor-based DVS. 
VI. PERFORMANCE EVALUATION AND DISCUSSION 
We designed and fabricated a 32 × 32 bit reconfigurable 
multiplier in AMIS 0.35-μm technology. The die photograph 
of the multiplier system is shown in Fig. 9 and the chip 
characteristics are shown in Table II. The operating mode
ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 765 
32 bit mode 
16 bit mode 
8 bit mode 
5 10 15 20 25 30 35 40 45 50 
2.5 
2.0 
1.5 
1.0 
Minimum Voltage (V) 
Frequency (MHz) 
Fig. 10. Experimental results of minimum voltage supply for different 
precisions and operating frequencies. 
of the multiplier is controlled by three external signals. 
The operating voltage and frequency are tuned automatically 
depending on the actual workload of the multiplier. The chip 
is tested by feeding in randomly generated operands and 
comparing the outputs with results from a PC processing the 
same data. The 32-bit precision data sets include data with 
an effective word-length of 17–32 bits. The 16-bit precision 
data sets and 8-bit precision data sets include data with an 
effective word-length of 9–16 bits and 0–8 bits, respectively. 
We achieved full functionality across a voltage range of 
0.8–3.3 V, and a frequency range of 5–50 MHz. Fig. 10 
shows the relation between the minimum supply voltage 
and operating frequency for different precision modes. As 
explained, razor energy savings are a result of the elimination 
of safety margins and processing below the first failure voltage. 
By scaling the voltage below the first failure point, an error 
rate of 0.1% is maintained and the power consumption is 
measured at this minimum possible voltage. For an operating 
frequency of 50 MHz, the supply voltage is set to 2.45, 1.95, 
and 1.80 V for the 32, 16, and 8-bit modes, respectively. 
For lower operating frequencies, the required supply voltage 
levels are much lower, as shown in Fig. 10. The chip power 
consumption for different operating modes is shown in Fig. 11. 
For 16-bit operands, 55.6% (17.35 versus 39.04 mW) power 
reduction can be obtained by the MP scheme. When the 
DVS technique is applied, the chip consumes 6.06 mW at 
the first failure point at an optimal 0.1% error rate, leading 
to a further 65.1% (6.06 versus 17.35 mW) power saving. 
Based on PP feature enabled, the operating frequency can 
be scaled to 1/3 of the original one, therefore the voltage 
would be tuned down to a much lower level for an additional 
46.7% (3.23 versus 6.06 mW) power reduction. For 8-bits 
operands, the MP, DVS, and PP schemes can help save 87.4% 
(4.90 versus 39.04 mW), 70.2% (1.46 versus 4.90 mW), and 
55.5% (0.65 versus 1.46 mW) power, respectively. 
Fig. 12 shows experimental results showing the power sav-ings 
associated to the MP, razor-based DVS, and PP features 
40 
35 
30 
25 
20 
15 
10 
5 
0 
8-b mode 
with DVS and PP 
16-b mode 
without DVS 
8-b mode 
with DVS 
8-b mode 
without DVS 
16-b mode 
with DVS 
and PP 
16-b mode 
with DVS 
32-b mode 
Without MP 
nor DVS nor PP 
MP with DVS 
with PP 
MP MP with DVS 
Power Consumption (mW) 
Fig. 11. Experimental results of power consumption of different operating 
schemes. 
Fig. 12. Experimental data showing power optimization spaces associated 
to MP, razor-based DVS, and PP schemes. 
of the fabricated 32 × 32 bit multiplier. Region 1 is the power 
optimization space for MP whereas Regions 2(a) and (b) are 
the power optimization spaces for the DVS technique without 
and with razor, respectively. Finally, region 3 is the power 
optimization space for PP. Fig. 12 shows that when MP is 
combined with DVS, power consumption is reduced to 29.12, 
8.07, and 1.98 mW (points , , and in Fig. 12) for 32, 
16, and 8-bit multiplications, respectively. In addition, razor 
flip-flops help reduce the operating voltage to the minimum 
possible level, resulting in a further power reduction of 26.1% 
(from 29.12 to 21.52 mW, point to , 24.9% (from 8.07 to 
6.06 mW, point to , and 26.3% (from 1.98 to 1.46 mW, 
point to ) for 32, 16, and 8-bit precision, respectively. 
Based on PP, the power reduction space is further enlarged. 
Table III compares the performance of the fabricated prototype 
with related works. [7], [20], [21] correspond to FP voltage 
schemes whereas [5], [22], [23] are MP, fixed voltage schemes. 
[9], [10] are MP, multivoltage schemes. To compare the silicon 
area associated to each scheme, we chose to use the number of
766 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 
transistors because it constitutes a fair metric to compare dif-ferent 
CMOS technology nodes. From Table III, the proposed 
multiplier provides the most reconfigurability while exhibit-ing 
the smallest relative area. Compared with the designs 
with the same maximum word-length of 32-bit [21], [23], 
our design boasts a much smaller area. For design [7] and 
design [5], their maximum word-length is 16-bit instead of 
32-bit. If we assume that a 32 × 32 bit multiplier is built 
using three 16 × 16 bit multipliers, then the area of the 32-bit 
multiplier is at least three times that of the 16-bit multiplier, 
discarding the glue and reconfigurability logic. This shows that 
the proposed multiplier outperforms reported implementations 
whether considering silicon area or reconfigurability. In regard 
to power dissipation, Table III shows normalized power results 
(using P = CV2 f α0−1) to cater for the different technolo-gies 
previously reported. As in previous works, we consider 
random input test patterns, with activity factors determined 
using models describing the propagation of the input statistics 
to the output of data-path operators [29]. Normalized power 
results show that the proposed multiplier outperforms reported 
implementations in terms of power dissipation. In previous 
works, flexibility and reconfigurability have come at a cost of 
increased silicon area and power consumption. In this paper, 
we propose an implementation that not only provides MP 
reconfigurable datapath, but also obtains a reduction in both 
silicon area and power, as compared with FP multipliers. In 
more advanced deep submicrometer processes, the proposed 
MP multiplier with razor DVS offers the ability to compensate 
for process variations. It would also be essential to integrate 
leakage reduction techniques [30], hence to jointly minimize 
leakage and dynamic power consumption. 
VII. INPUT OPERANDS SCHEDULER 
A. Motivation and Operation Principle 
In previous section, we report experimental results obtained 
using different data sets, each composed of randomly 
generated single-precision operands. However, in some 
applications such as artificial neural network applications, the 
input data stream could include mixed-precision operands [1]. 
Although our multiplier provides three different precision 
modes (32 × 32 bit, 16 × 16 bit, 8 × 8 bit), the supply 
voltage would still have to transit dynamically between the 
minimum required voltage levels Vmin32, Vmin16, or Vmin8 
required for 32, 16, and 8-bit operands, respectively. Fig. 10 
shows that given a certain operating frequency, the differ-ence 
among Vmin32, Vmin16, and Vmin8 can be in the range 
of 0.1–0.65 V. If the input data stream requires frequent 
supply voltage transitions, significant dynamic power would 
be dissipated, thereby undermining the benefits of DVS. 
In addition, these transitions may not always be possible within 
one clock cycle. To minimize the overall power consumption, 
one needs to reduce the number of supply voltage transitions 
while still processing operands at the minimum required 
voltage level. To address this problem, we propose an IOS 
that will perform the following tasks: 1) reorder the input 
data stream such that same-precision operands are grouped 
together into a buffer (Fig. 13) and 2) find the minimum supply 
voltages (Vmin32, Vmin16, Vmin8), and operating frequencies 
( f32, f16, f8) for the three different-precision data groups to 
minimize the overall power consumption while still meeting 
the specified throughput. 
The block diagram of the IOS is shown in Fig. 13. It is 
composed of an operand range detector, a pattern generation 
engine, a 2 k-bit buffer-(RAM), and a frequency/voltage ana-lyzer. 
The scheduler operates as follows. The inputs operands 
are first sent to the range detector, which classifies them 
according to their precision: 32, 16, or 8-bit. The classified 
data is then grouped by the pattern generation engine, which 
packs same-precision data into three different 32-bit data 
patterns (Fig. 13): 1) pattern 1 corresponds to original 32-bit 
input operand Data; 2) pattern 2 combines two 16-bit operands 
data (with their redundant 16 MSBs removed); and 3) pattern 
3 combines four 8-bit operand data (with their redundant 
24 MSBs removed). At each clock cycle, a 32-bit data pattern 
can be processed, owning to the PP capability of the proposed 
multiplier. This resembles the SIMD structure, and helps to put 
the MP and PP capability into real effect. As in Fig. 13, the 
three different data patterns are counted (N32, N16, and N8) 
and stored into a buffer, together with the respective voltages 
and clock frequencies at which they should be processed. For 
each full buffer, there will only be two transitions needed: 
(Vmin8, f8)–(Vmin16, f16), and (Vmin16, f16)–(Vmin32, f32). 
To limit the silicon area overhead, we chose a 2k-RAM, 
which can store 60 32-bit data patterns. The voltage/frequency 
analyzer specifies the values of Vmin32, Vmin16, Vmin8, f32, f16, 
and f8 to the dithering unit and VCO. The Vmin– f pairs are 
determined during the characterization of the chip and stored 
in the LUT (Fig. 13). 
B. Problem Formulation 
Given a random mixed-precision (32-, 16-, or 8-bit) input 
data stream and specified throughput Tp, our goal is to 
determine the voltages (Vmin32, Vmin16, Vmin8) and frequencies 
( f32, f16 and f8) at which each precision data group should be 
processed such that the total power consumption is minimized. 
In the following analysis, we consider the following four 
components of the total power consumption: 1) the resis-tive 
power loss Pdith_resistic_loss of the dithering unit; 2) the 
switching power loss Pdith_switching_loss of the dithering unit; 
3) the dynamic power consumption Pcomputation associated to 
the multiplication computation; and 4) finally, Pcompu_overhead 
that corresponds to the power consumption of the latter 
computation when carried out at voltage levels higher than 
the nominal Vmin. The equations of the aforementioned four 
components of the total power consumption are given below 
Pdith_resistic_loss = I 2 
char Ron (5) 
where Ichar is the charge current of the dithering unit, and Ron 
is the equivalent resistance of the dithering switch 
2 f 
N 
Pdith_switching_loss = CgVdd 
(6) 
where Cg is the gate capacitance of the dithering switch, Vdd 
is the 3.3 V standard voltage, and N is the number of input
ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 767 
TABLE III 
PERFORMANCE COMPARISON OF PROPOSED MULTIPLIERWITH RELATED WORKS 
Pattern 1 32-b ope 
Pattern 2 
Pattern 3 
8-b ope 8-b ope 8-b ope 8-b ope 
Range 
Detector 
Pattern 
Generation 
Engine 
Buffer 
(RAM) 
Vmin-freqency 
Look-up-table 
Voltage  
Frequency 
Analyzer 
Input 
Operands 
Stream 
Dithering 
Unit 
VCO 
Multi-precision 
Multiplier 
32-b ope 
32-b ope 
.. 
32-b ope 
16-b ope 16-b ope 
16-b ope 16-b ope 
16-b ope 16-b ope 
.. 
16-b ope 16-b ope 
8-b ope 8-b ope 8-b ope 8-b ope 
8-b ope 8-b ope 8-b ope 8-b ope 
8-b ope 8-b ope 8-b ope 8-b ope 
8-b ope 8-b ope 8-b ope 8-b ope 
.. 
8-b ope 8-b ope 8-b ope 8-b ope 
Pattern 1 data 
Pattern 2 data 
Pattern 3 data 
16-b ope 16-b ope 
Algorithm 
A 
Algorithm 
B 
Algorithm 
C 
Fig. 13. Block diagram of IOS. 
data patterns 
Pcomputation = CmVmin 
2 f (7) 
where Cm is the effective capacitance of the multiplier, Vmin 
is the applied minimum supply voltage, and f is the applied 
operating frequency 
Pcompu_overhead =  Pdt 
T 
=  CmV 2 f dV 
T 
(8) 
where V is the dithering unit output, which fluctuates around 
Vmin, and T is the charge time period, which is inversely 
proportional to the operating frequency. 
The overall power consumption is thus given by 
Poverall = Pcomputation + Pcompu_overhead 
+Pdith_resistic_loss + Pdith_switching_loss. (9) 
In the following, we present three different algorithms 
to reduce this overall power consumption. Each of these 
algorithms constitutes a different approach to process the 
mixed-precision data held in the operands buffer (Fig. 13). The 
performance of each algorithm is evaluated using a mixed-precision 
data set of 120 000 randomly operands, with a 
third corresponding to each precision (8-, 16-, and 32-bit). 
In the following, the specified throughput Tp for the proposed
768 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 
Fig. 14. Operation principles of operand scheduling algorithms A, B, and C. Data Block X and Data Block X+1 refer to two-consecutive operand data 
blocks subsequently stored into the RAM, respectively. 
TABLE IV 
DETAILED POWER PERFORMANCE OF DIFFERENT SCHEDULING ALGORITHMS 
Algorithm P_computation P_compu_overhead P_dith_resistic_loss P_dith_switching_loss P_overall 
A 3.034 mW 3.159 mW 0.059 mW 1.715 mW 8.255 mW 
B 2.266 mW 2.565 mW 0.084 mW 1.663 mW 6.578 mW 
C 1.682 mW 1.843 mW 0.062 mW 0.975 mW 4.561 mW 
32 × 32 bit multiplier is 64 F (Mbits/s), where F is the 
multiplier’s operating frequency. 
C. Algorithm A 
In the first algorithm, the multiplier throughput Tp = 64 F is 
kept constant by fixing the operating frequencies ( f32−, f16−, 
or f8) of each precision-data group (32-, 16-, or 8-bit) to 
f32 = F, f16 = F 
2 
, f8 = F 
4 
(10) 
where F is the multiplier’s operating frequency. This is 
because the throughput in 8 × 8 bit multiplication mode is four 
times that of the 32 × 32 bit multiplication mode and double 
that of the 16 × 16 bit multiplication mode, as a result of the 
multiplier PP. The minimum supply voltage (Vmin32, Vmin16 
or Vmin8) associated to each operating frequency ( f32, f16 
or f8) is determined through a Vmin– f LUT. Algorithm A 
shows its limitations when 32-bit operands are processed 
initially. As shown in Fig. 14, once all N32 operands of the data 
block are processed, the supply voltage (Vn) needs to decrease 
rapidly from point A (Vmin32) to point B (Vmin16) at which all 
N16 16-bit operands of the data block should be processed. 
If N16 is too small, most 16-bit operands will be actually 
processed in Sections A and B, that is at a voltage possibly 
much higher than the minimal Vmin16 level. Similarly 8-bit 
operands of the data block could be processed in Sections C 
and D, B-C, or even A-B for the worst case. This contributes to 
increasing Pcompu_overhead. The overall power performance of 
algorithm A is shown in Table IV. Compared with the fixed-width 
32 × 32 bit standard multiplier (32 × 32 bit mode 
must be chosen given that a third of operands are 32-bit), 
77.7% total power reduction is achieved with a total silicon 
area overhead of only 11.1%, when considering DVS, razor, 
RAM, and dedicated circuitry for scheduling algorithm A. 
D. Algorithm B 
This algorithm removes all transitions of the power supply 
voltage by making Vmin32, Vmin16, and Vmin8 equal and adjust-ing 
f32, f16, and f8 such that the overall throughput is kept 
unchanged. We thus need to have the following: 
64N32 + 128N16 + 256N8 
N32 
f32 
+ N16 
f16 
+ N8 
f8 
= 64 F. (11) 
From a LUT, we can obtain the Vmin– f relationship as 
follows: 
Vmin32 = ψ32( f32) (12) 
Vmin16 = ψ16( f16) (13) 
Vmin8 = ψ8( f8). (14) 
As algorithm B keeps the supply voltage constant 
ψ32( f32) = ψ16( f16) = ψ8( f8) = V (15)
ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 769 
the operating frequencies f32, f16, and f8 can be determined 
by using (11) and (15). For example, when F is set to 
50 MHz, the values for V , f32, f16, and f8 are found to 
be 1.35 V, 20 MHz, 25 MHz, and 35 MHz, respectively. 
The overall power consumption of algorithm B is shown in 
Table IV. Due to the complete removal of voltage transitions, 
the Pcompu_overhead is reduced. Simultaneously, because of 
holistic planning, the dynamic computation power is also 
optimized to a lower level. Compared with the fixed-width 
32 × 32 bit standard multiplier, 81.5% power reduction is 
achieved with a total silicon area overhead of only 11.9%, 
when considering DVS, razor, RAM, and dedicated circuitry 
for scheduling Algorithm B. 
E. Algorithm C 
Although Algorithm B removes power supply voltage tran-sitions 
by setting a single-voltage level V, there may be 
better power saving combinations of power supply voltages 
and operating frequencies: (Vmin32, f32), (Vmin16, f16), and 
(Vmin8, f8). The aim of algorithm C is to find such an optimum 
for reduced power consumption. To limit complexity, we will 
only seek to minimize the dynamic power dissipated as a result 
of the computation 
P = CV2 f (16) 
= Cm32V2 min32 f32 + Cm16V2 min16 f16 + Cm8V 2 
min8 f8 (17) 
= χ( f32, f16). (18) 
Given that the Vmin– f relationships are known (12)–(14), 
one could find the minimum of the above equation for the 
specified throughput (11). For example, when F is set to 
50 MHz, the values for (Vmin32, f32), (Vmin16, f16), (Vmin8, f8) 
are found to be (1.15 V, 15 MHz), (1.30 V, 20 MHz), and 
(1.75 V, 45 MHz), respectively. The overall power perfor-mance 
of algorithm C is shown in Table IV. When consid-ering 
DVS, razor, RAM, and dedicated scheduling circuitry, 
algorithm B exhibits the least power consumption, with an 
overall power reduction of 86.3%, compared with the standard 
32 × 32 bit fixed-width multiplier. However, it requires two 
additional dithering units to generate all three discrete power 
supply levels Vmin32, Vmin16, and Vmin8 and thus remove 
transitions among these different supply levels. This increases 
the total silicon area overhead to 27.1%. Therefore, algorithm 
B provides the most attractive tradeoff with 81.5% reduction 
and a silicon area overheard of just 11.9%. 
VIII. CONCLUSION 
We proposed a novel MP multiplier architecture featuring, 
respectively, 28.2% and 15.8% reduction in silicon area and 
power consumption compared with its 32 × 32 bit conven-tional 
fixed-width multiplier counterpart.When integrating this 
MP multiplier architecture with an error-tolerant razor-based 
DVS approach and the proposed novel operands scheduler, 
77.7%–86.3% total power reduction was achieved with a total 
silicon area overhead as low as 11.1%. The fabricated chip 
demonstrated run-time adaptation to the actual workload by 
operating at the minimum supply voltage level and mini-mum 
clock frequency while meeting throughput requirements. 
The proposed novel dedicated operand scheduler rearranges 
operations on input operands, hence to reduce the number of 
transitions of the supply voltage and, in turn, minimized the 
overall power consumption of the multiplier. The proposed MP 
razor-based DVS multiplier provided a solution toward achiev-ing 
full computational flexibility and low power consumption 
for various general purpose low-power applications. 
ACKNOWLEDGMENT 
The authors would like to thank Dr. M.K. Law for his 
comments and discussions.We also would like to acknowledge 
Mr. S.F. Luk for his help with the chip test measurements. 
REFERENCES 
[1] R. Min, M. Bhardwaj, S.-H. Cho, N. Ickes, E. Shih, A. Sinha, A. Wang, 
and A. Chandrakasan, “Energy-centric enabling technologies for wire-less 
sensor networks,” IEEE Wirel. Commun., vol. 9, no. 4, pp. 28–39, 
Aug. 2002. 
[2] M. Bhardwaj, R. Min, and A. Chandrakasan, “Quantifying and enhanc-ing 
power awareness of VLSI systems,” IEEE Trans. Very Large Scale 
Integr. (VLSI) Syst., vol. 9, no. 6, pp. 757–772, Dec. 2001. 
[3] A. Wang and A. Chandrakasan, “Energy-aware architectures for a real-valued 
FFT implementation,” in Proc. IEEE Int. Symp. Low Power 
Electron. Design, Aug. 2003, pp. 360–365. 
[4] T. Kuroda, “Low power CMOS digital design for multimedia proces-sors,” 
in Proc. Int. Conf. VLSI CAD, Oct. 1999, pp. 359–367. 
[5] H. Lee, “A power-aware scalable pipelined booth multiplier,” in Proc. 
IEEE Int. SOC Conf., Sep. 2004, pp. 123–126. 
[6] S.-R. Kuang and J.-P. Wang, “Design of power-efficient configurable 
booth multiplier,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, 
no. 3, pp. 568–580, Mar. 2010. 
[7] O. A. Pfander, R. Hacker, and H.-J. Pfleiderer, “A multiplexer-based 
concept for reconfigurable multiplier arrays,” in Proc. Int. Conf. Field 
Program. Logic Appl., vol. 3203. Sep. 2004, pp. 938–942. 
[8] F. Carbognani, F. Buergin, N. Felber, H. Kaeslin, and W. Fichtner, 
“Transmission gates combined with level-restoring CMOS gates reduce 
glitches in low-power low-frequency multipliers,” IEEE Trans. Very 
Large Scale Integr. (VLSI) Syst., vol. 16, no. 7, pp. 830–836, Jul. 2008. 
[9] T. Yamanaka and V. G. Moshnyaga, “Reducing multiplier energy by 
data-driven voltage variation,” in Proc. IEEE Int. Symp. Circuits Syst., 
May 2004, pp. 285–288. 
[10] W. Ling and Y. Savaria, “Variable-precision multiplier for equalizer with 
adaptive modulation,” in Proc. 47th Midwest Symp. Circuits Syst., vol. 1. 
Jul. 2004, pp. I-553–I-556. 
[11] K.-S. Chong, B.-H. Gwee, and J. S. Chang, “A micropower low-voltage 
multiplier with reduced spurious switching,” IEEE Trans. Very Large 
Scale Integr. (VLSI) Syst., vol. 13, no. 2, pp. 255–265, Feb. 2005. 
[12] M. Sjalander, M. Drazdziulis, P. Larsson-Edefors, and H. Eriks-son, 
“A low-leakage twin-precision multiplier using reconfigurable 
power gating,” in Proc. IEEE Int. Symp. Circuits Syst., May 2005, 
pp. 1654–1657. 
[13] S.-R. Kuang and J.-P. Wang, “Design of power-efficient pipelined 
truncated multipliers with various output precision,” IET Comput. Digital 
Tech., vol. 1, no. 2, pp. 129–136, Mar. 2007. 
[14] J. L. Holt and J.-N. Hwang, “Finite precision error analysis of neural 
network hardware implementations,” IEEE Trans. Comput., vol. 42, 
no. 3, pp. 281–290, Mar. 1993. 
[15] A. Bermak, D. Martinez, and J.-L. Noullet, “High-density 16/8/4-bit 
configurable multiplier,” Proc. Inst. Electr. Eng. Circuits Devices Syst., 
vol. 144, no. 5, pp. 272–276, Oct. 1997. 
[16] T. Kuroda, “Low power CMOS digital design for multimedia proces-sors,” 
in Proc. Int. Conf. VLSI CAD, Oct. 1999, pp. 359–367. 
[17] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen, 
“A dynamic voltage scaled microprocessor system,” IEEE J. Solid-State 
Circuits, vol. 35, no. 11, pp. 1571–1580, Nov. 2000. 
[18] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, 
A. Chiba, Y. Watanabe, K. Matsuda, T. Maeda, T. Sakurai, and 
T. Furuyama, “Variable supply-voltage scheme for low-power high-speed 
CMOS digital design,” IEEE J. Solid-State Circuits, vol. 33, no. 3, 
pp. 454–462, Mar. 1998.
770 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 
[19] M. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo, 
A. Hashiguchi, H. Kawahara, K. Kumano, and M. Shimura, “Dynamic 
voltage and frequency management for a low-power embedded micro-processor,” 
IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 28–35, 
Jan. 2005. 
[20] J.-Y. Kang and J.-L. Gaudiot, “A simple high-speed multiplier design 
computers,” IEEE Trans. Comput., vol. 55, no. 10, pp. 1253–1258, 
Oct. 2006. 
[21] G. Y. Jeong, J. S. Park, and H. C. Kang, “A Study on multiplier 
architecture optimized for 32-bit processor with 3-stage pipeline,” in 
Proc. Int. SoC Design Conf., Oct. 2004, pp. 656–660. 
[22] S. Perri, P. Corsonello, M. A. Iachino, M. Lanuzza, and G. Cocorullo, 
“Variable precision arithmetic circuits for FPGA-based multimedia 
processors,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, 
no. 9, pp. 995–999, Sep. 2004. 
[23] S. D. Haynes, A. Ferrari, and P. Y. K. Cheung, “Flexible reconfigurable 
multiplier blocks suitable for enhancing the architecture of FPGAs,” in 
Proc. IEEE Custom Integr. Circuits, May 1999, pp. 191–194. 
[24] S. Das, D. Blaauw, D. Bull, K. Flautner, and R. Aitken, “Addressing 
design margins through error-tolerant circuits,” in Proc. Design Autom. 
Conf., Jul. 2009, pp. 11–12. 
[25] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, 
D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “Razor: A low-power 
pipeline based on circuit-level timing speculation,” in Proc. Int. Symp. 
Microarchit., Dec. 2003, pp. 7–18. 
[26] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, T. Mudge, and 
K. Flautner, “A self-tuning DVS processor using delay-error detection 
and correction,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 792–804, 
Apr. 2006. 
[27] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai, 
D. M. Bull, and D. T. Blaauw, “RazorII: In situ error detection and 
correction for PVT and SER tolerance,” IEEE J. Solid-State Circuits, 
vol. 44, no. 1, pp. 32–48, Jan. 2009. 
[28] B. Calhoun and A. Chandrakasan, “Ultra-dynamic voltage scaling using 
sub-threshold operation and local voltage dithering in 90 nm CMOS,” 
in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2005, 
pp. 300–301. 
[29] E. D. Kyriakis-Bitzaros and S. Nikolaidis, “Estimation of bit-level tran-sition 
activity in datapaths based on word-level statistics and conditional 
entropy,” IEE Proc. Circuits, Devices Syst., vol. 149, no. 4, pp. 234–240, 
Aug. 2002. 
[30] A. Youssef, M. Anis, and M. Elmasry, “A comparative study between 
static and dynamic sleep signal generation techniques for leakage 
tolerant designs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 
vol. 16, no. 9, pp. 1114–1126, Sep. 2008. 
Xiaoxiao Zhang (S’06) received the B.S. degree 
from the Department of Microelectronics, Tianjin 
University, Tianjin, China, and the M.E. degree from 
the Institute of Microelectronics, Chinese Academy 
of Sciences, Beijing, China, in 2003 and 2006, 
respectively. She is currently pursuing the Ph.D. 
degree with the Electronic and Computer Engineer-ing 
Department, Hong Kong University of Science 
and Technology, Hong Kong. Her Ph.D. research 
work involves the design of low-power real-time 
digital image processing (DIP) cores or modules for 
a camera-on-a-chip. 
Her current research interests include low-power and high-performance 
VLSI circuits design, signal processing architectures, face detection, and 3-D 
object/face recognition. 
Farid Boussaid (M’00–SM’04) received the M.S. 
and Ph.D. degrees in microelectronics from the 
National Institute of Applied Science (INSA), 
Toulouse, France, in 1996 and 1999, respectively. 
He joined Edith Cowan University, Perth, Aus-tralia, 
as a Postdoctoral Research Fellow, and 
a member of the Visual Information Processing 
Research Group in 2000. He joined the University 
of Western Australia, Crawley, Australia, in 2005, 
where he is currently an Associate Professor. 
His current research interests include smart CMOS 
vision sensors, gas sensors, neuromorphic systems, device simulation, mod-eling, 
and characterization in deep submicron CMOS processes. 
Amine Bermak (M’99–SM’04–F’13) received the 
M.Eng. and Ph.D. degrees in electronic engineering 
from Paul Sabatier University, Toulouse, France, in 
1994 and 1998, respectively. 
He joined the Advanced Computer Architecture 
Research Group, York University, York, U.K., where 
he was working as a Post-Doctoral Fellow on 
VLSI implementation of CMM neural network for 
vision applications in a project funded by British 
Aerospace. He joined Edith Cowan University, 
Perth, Australia, in 1998, first as a Research Fellow 
working on smart vision sensors, then as a Lecturer and a Senior Lecturer. 
He is currently a Professor with the Electronic and Computer Engineering 
Department, Hong Kong University of Science and Technology (HKUST), 
Hong Kong. His current research interests include VLSI circuits and systems 
for signal, image processing, sensors, and microsystems applications. 
Dr. Bermak was a recipient of many distinguished awards, including the 
2004 “IEEE Chester Sall Award,” the HKUST “Engineering School Teaching 
Excellence Award” in 2004 and 2009, and the “Best Paper Award” at the 2005 
International Workshop on System-On-Chip for Real-Time Applications.

More Related Content

PDF
Dual technique of reconfiguration and capacitor placement for distribution sy...
PDF
A Survey on Low Power VLSI Designs
PDF
Optimal Placement and Sizing of Distributed Generation Units Using Co-Evoluti...
PDF
Reliability Assesment of Debremarkos Distribution System Found In Ethiopia
PDF
Loss Reduction by Optimal Placement of Distributed Generation on a Radial feeder
PDF
IRJET- Comparison of Multiplier Design with Various Full Adders
PPTX
Genetic Algo. for Radial Distribution System to reduce Losses
PPTX
ECE561_finalProject
Dual technique of reconfiguration and capacitor placement for distribution sy...
A Survey on Low Power VLSI Designs
Optimal Placement and Sizing of Distributed Generation Units Using Co-Evoluti...
Reliability Assesment of Debremarkos Distribution System Found In Ethiopia
Loss Reduction by Optimal Placement of Distributed Generation on a Radial feeder
IRJET- Comparison of Multiplier Design with Various Full Adders
Genetic Algo. for Radial Distribution System to reduce Losses
ECE561_finalProject

What's hot (17)

PDF
Optimum reactive power compensation for distribution system using dolphin alg...
PDF
Network Reconfiguration in Distribution Systems Using Harmony Search Algorithm
PDF
Energy efficient-resource-allocation-in-distributed-computing-systems
PDF
Design of Multiplier using Low Power CMOS Technology
PDF
Design and Analysis of Capacitive Power Transfer System with and without the ...
PDF
Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...
PDF
AN ULTRA-LOW POWER ROBUST KOGGESTONE ADDER AT SUB-THRESHOLD VOLTAGES FOR IMPL...
PDF
Review on Optimal Allocation of Capacitor in Radial Distribution System
PDF
Voltage Sag Compensation in Fourteen Bus System Using IDVR
PDF
A Low Power Row and Column Compression for High-Performance Multiplication on...
PDF
A Brief Survey of Current Power Limiting Strategies
PDF
Optimal planning of RDGs in electrical distribution networks using hybrid SAP...
PDF
Interplay of Communication and Computation Energy Consumption for Low Power S...
PDF
A Novel Approach for Allocation of Optimal Capacitor and Distributed Generati...
PDF
A Dynamic Programming Approach to Energy-Efficient Scheduling on Multi-FPGA b...
PDF
Hybrid bypass technique to mitigate leakage current in the grid-tied inverter
PDF
REAL-TIME ADAPTIVE ENERGY-SCHEDULING ALGORITHM FOR VIRTUALIZED CLOUD COMPUTING
Optimum reactive power compensation for distribution system using dolphin alg...
Network Reconfiguration in Distribution Systems Using Harmony Search Algorithm
Energy efficient-resource-allocation-in-distributed-computing-systems
Design of Multiplier using Low Power CMOS Technology
Design and Analysis of Capacitive Power Transfer System with and without the ...
Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley...
AN ULTRA-LOW POWER ROBUST KOGGESTONE ADDER AT SUB-THRESHOLD VOLTAGES FOR IMPL...
Review on Optimal Allocation of Capacitor in Radial Distribution System
Voltage Sag Compensation in Fourteen Bus System Using IDVR
A Low Power Row and Column Compression for High-Performance Multiplication on...
A Brief Survey of Current Power Limiting Strategies
Optimal planning of RDGs in electrical distribution networks using hybrid SAP...
Interplay of Communication and Computation Energy Consumption for Low Power S...
A Novel Approach for Allocation of Optimal Capacitor and Distributed Generati...
A Dynamic Programming Approach to Energy-Efficient Scheduling on Multi-FPGA b...
Hybrid bypass technique to mitigate leakage current in the grid-tied inverter
REAL-TIME ADAPTIVE ENERGY-SCHEDULING ALGORITHM FOR VIRTUALIZED CLOUD COMPUTING
Ad

Similar to 32 bit×32 bit multiprecision razor based dynamic (20)

PDF
Design of power and delay efficient 32 bit x 32 bit multi precision multiplie...
PDF
Low Power 16×16 Bit Multiplier Design using Dadda Algorithm
PDF
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
PDF
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
PDF
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
PDF
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
PDF
Low Power VLSI Design of Modified Booth Multiplier
PDF
Parallel Processing Technique for Time Efficient Matrix Multiplication
PDF
Design and Implementation of Low Power DSP Core with Programmable Truncated V...
PDF
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
PDF
40120130405014
PDF
Fa35880883
PDF
Analysis, verification and fpga implementation of low power multiplier
PDF
Analysis, verification and fpga implementation of low power multiplier
PDF
G1103026268
PDF
AGING EFFECT TOLERANT MULTIPRECISION RAZOR-BASED MULTIPLIER
PDF
Implementation of Radix-4 Booth Multiplier by VHDL
PDF
An efficient floating point adder for low-power devices
PDF
DESIGN OF LOW POWER MULTIPLIER
Design of power and delay efficient 32 bit x 32 bit multi precision multiplie...
Low Power 16×16 Bit Multiplier Design using Dadda Algorithm
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
A METHODOLOGY FOR IMPROVEMENT OF ROBA MULTIPLIER FOR ELECTRONIC APPLICATIONS
Low Power VLSI Design of Modified Booth Multiplier
Parallel Processing Technique for Time Efficient Matrix Multiplication
Design and Implementation of Low Power DSP Core with Programmable Truncated V...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
40120130405014
Fa35880883
Analysis, verification and fpga implementation of low power multiplier
Analysis, verification and fpga implementation of low power multiplier
G1103026268
AGING EFFECT TOLERANT MULTIPRECISION RAZOR-BASED MULTIPLIER
Implementation of Radix-4 Booth Multiplier by VHDL
An efficient floating point adder for low-power devices
DESIGN OF LOW POWER MULTIPLIER
Ad

32 bit×32 bit multiprecision razor based dynamic

  • 1. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 759 32 Bit×32 Bit Multiprecision Razor-Based Dynamic Voltage Scaling Multiplier With Operands Scheduler Xiaoxiao Zhang, Student Member, IEEE, Farid Boussaid, Senior Member, IEEE, and Amine Bermak, Fellow, IEEE Abstract—In this paper, we present a multiprecision (MP) reconfigurable multiplier that incorporates variable precision, parallel processing (PP), razor-based dynamic voltage scaling (DVS), and dedicated MP operands scheduling to provide opti-mum performance for a variety of operating conditions. All of the building blocks of the proposed reconfigurable multiplier can either work as independent smaller-precision multipliers or work in parallel to perform higher-precision multiplications. Given the user’s requirements (e.g., throughput), a dynamic volt-age/ frequency scaling management unit configures the multiplier to operate at the proper precision and frequency. Adapting to the run-time workload of the targeted application, razor flip-flops together with a dithering voltage unit then configure the multiplier to achieve the lowest power consumption. The single-switch dithering voltage unit and razor flip-flops help to reduce the voltage safety margins and overhead typically associated to DVS to the lowest level. The large silicon area and power overhead typically associated to reconfigurability features are removed. Finally, the proposed novel MP multiplier can further benefit from an operands scheduler that rearranges the input data, hence to determine the optimum voltage and frequency operating conditions for minimum power consumption. This low-power MP multiplier is fabricated in AMIS 0.35-μm technology. Experimental results show that the proposed MP design features a 28.2% and 15.8% reduction in circuit area and power consumption compared with conventional fixed-width multiplier. When combining this MP design with error-tolerant razor-based DVS, PP, and the proposed novel operands scheduler, 77.7%–86.3% total power reduction is achieved with a total silicon area overhead as low as 11.1%. This paper successfully demonstrates that a MP architecture can allow more aggressive frequency/supply voltage scaling for improved power efficiency. Index Terms—Computer arithmetic, dynamic voltage scaling, low power design, multi-precision multiplier. I. INTRODUCTION CONSUMERS demand for increasingly portable yet high-performance multimedia and communication products imposes stringent constraints on the power consumption of individual internal components [1]–[4]. Of these, multipliers perform one of the most frequently encountered arithmetic Manuscript received June 8, 2012; revised February 11, 2013; accepted February 20, 2013. Date of publication April 18, 2013; date of current version March 18, 2014. This work was supported in part by a grant from the HK Research Grant Council, under Grant 610509 and the Australian Research Council’s Discovery Projects Funding Scheme under Grant DP130104374. X. Zhang and A. Bermak are with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong (e-mail: zhangxx@ust.hk; eebermak@ust.hk). F. Boussaid is with the School of Electrical, Electronic, and Computer Engineering, The University of Western Australia, Perth 6017, Australia (e-mail: farid.boussaid@uwa.edu.au). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2013.2252032 operations in digital signal processors (DSPs) [4]. For embed-ded applications, it has become essential to design more power-aware multipliers [4]–[13]. Given their fairly complex structure and interconnections, multipliers can exhibit a large number of unbalanced paths, resulting in substantial glitch generation and propagation [8], [11]. This spurious switching activity can be mitigated by balancing internal paths through a combination of architectural and transistor-level optimization techniques [8], [11]. In addition to equalizing internal path delays, dynamic power reduction can also be achieved by mon-itoring the effective dynamic range of the input operands so as to disable unused sections of the multiplier [6], [12] and/or truncate the output product at the cost of reduced precision [13]. This is possible because, in most sensor applications, the actual inputs do not always occupy the entire magnitude of its word-length. For example, in artificial neural network applications, the weight precision used during the learning phase is approximately twice that of the retrieval phase [14]. Besides, operations in lower precisions are the most frequently required. In contrast, most of today’s full-custom DSPs and application-specific integrated circuits (ASICs) are designed for a fixed maximum word-length so as to accommodate the worst case scenario. Therefore, an 8-bit multiplication com-puted on a 32-bit Booth multiplier would result in unnecessary switching activity and power loss. Several works investigated this word-length optimization. [1], [2] proposed an ensemble of multipliers of different pre-cisions, with each optimized to cater for a particular scenario. Each pair of incoming operands is routed to the smallest multiplier that can compute the result to take advantage of the lower energy consumption of the smaller circuit. This ensemble of point systems is reported to consume the least power but this came at the cost of increased chip area given the used ensemble structure. To address this issue, [3], [5] proposed to share and reuse some functional modules within the ensemble. In [3], an 8-bit multiplier is reused for the 16-bit multiplication, adding scalability without large area penalty. Reference [5] extended this method by implementing pipelining to further improve the multiplier’s performance. A more flexible approach is proposed in [15], with several mul-tiplier elements grouped together to provide higher precisions and reconfigurability. Reference [7] analyzed the overhead associated to such reconfigurable multipliers. This analysis showed that around 10%–20% of extra chip area is needed for 8–16 bits multipliers. Combining multiprecision (MP) with dynamic voltage scal-ing (DVS) can provide a dramatic reduction in power con- 1063-8210 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
  • 2. 760 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 sumption by adjusting the supply voltage according to circuit’s run-time workload rather than fixing it to cater for the worst case scenario [4]. When adjusting the voltage, the actual performance of the multiplier running under scaled voltage has to be characterized to guarantee a fail-safe operation. Conventional DVS techniques consist mainly of lookup table (LUT) and on-chip critical path replica approaches [17]–[19]. The LUT approach tunes the supply voltage according to a predefined voltage-frequency relationship stored in a LUT, which is formed considering worst case conditions (process variations, power supply droops, temperature hot-spots, cou-pling noise, and many more). Therefore, large margins are necessarily added, which in turn significantly decrease the effectiveness of the DVS technique. The critical path replica approach typically involves an on-chip critical path replica to approximate the actual critical path. Therefore, voltage could be scaled to the extent that the replica fails to meet the timing. However, safety margins are still needed to compensate for the intradie delay mismatch and address fast-changing transient effects [24]. In addition, the critical path may change as a result of the varying supply voltage or process or tempera-ture variations. If this occurs, computations will completely fail regardless of the safety margins. The aforementioned limitations of conventional DVS techniques motivated recent research efforts into error-tolerant DVS approaches [24]–[27], which can run-time operate the circuit even at a voltage level at which timing errors occur. A recovery mechanism is then applied to detect error occurrences and restore the correct data. Because it completely removes worst case safety margins, error-tolerant DVS techniques can further aggressively reduce power consumption. In this paper, we propose a low power reconfigurable multiplier architecture that combines MP with an error-tolerant DVS approach based on razor flip-flops [25]. The main contributions of this paper can be summarized follows. 1) A novel MP multiplier architecture featuring, respectively, 28.2% and 15.8% reduction in silicon area and power consumption compared with its conventional 32 × 32 bit fixed-width multiplier counterpart. All reported multipliers trade silicon area/power consumption for MP [7]. In this paper, silicon area is optimized by applying an operation reduction technique that replaces a multiplier by adders/subtractors. 2) A silicon implementation of this MP multiplier integrating an error-tolerant razor-based dynamic DVS approach. The fabricated chip demonstrates run-time adaptation to the actual workload by operating at the minimum supply voltage level and minimum clock frequency while meeting throughput requirements. Prior works combining MP with DVS have only considered a limited number of offline simulated precision-voltage pairs, with unnecessary large safety margins added to cater for critical paths [9], [10]. 3) A novel dedicated operand scheduler that rearranges operations on input operands so as to reduce the number of transitions of the supply voltage and, in turn, minimize the overall power consumption of the multiplier. Unlike reported scheduling works, the Performance request Input data flow Voltage and Frequency Management Unit (VFMU) Input Operands Scheduler (IOS) Target voltage Target clock frequency reference System-on-chip Voltage Scaling Unit (VSU) Clock Frequency Scaling Unit (FSU/VCO) FPGA Multi-precision Multiplier Scheduled data flow Supply voltage Operating Clock Multiplication results Error feedback reference Fig. 1. Overall multiplier system architecture. function of the proposed scheduler is not task scheduling rather input operands scheduling for the proposed MP multiplier. The rest of this paper is organized as follows. Section II presents the operation and architecture of the proposed MP multiplier. Section III presents the approach used to reduce the overhead associated to MP and reconfigurability. Section IV presents the operating principle and implementation of the DVS management unit. Section V presents the razor flip-flops, which are at the heart of the DVS flow. Section VI presents experimental results. Section VII presents the operands sched-uler unit. Finally, a conclusion is given in Section VIII. II. SYSTEM OVERVIEW AND OPERATION The proposed MP multiplier system (Fig. 1) comprises five different modules that are as follows: 1) the MP multiplier; 2) the input operands scheduler (IOS) whose function is to reorder the input data stream into a buffer, hence to reduce the required power supply voltage transitions; 3) the frequency scaling unit implemented using a voltage controlled oscillator (VCO). Its function is to generate the required operating frequency of the multiplier; 4) the voltage scaling unit (VSU) implemented using a volt-age dithering technique to limit silicon area overhead. Its function is to dynamically generate the supply voltage so as to minimize power consumption; 5) the dynamic voltage/frequency management unit (VFMU) that receives the user requirements (e.g., throughput). The VFMU sends control signals to the VSU and FSU to generate the required power supply voltage and clock frequency for the MP multiplier. The MP multiplier is responsible for all computations. It is equipped with razor flip-flops that can report timing
  • 3. ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 761 Fig. 2. Possible configuration modes of proposed MP multiplier. errors associated to insufficiently high voltage supply levels. The operation principle is as follows. Initially, the multiplier operates at a standard supply voltage of 3.3 V. If the razor flip-flops of the multiplier do not report any errors, this means that the supply voltage can be reduced. This is achieved through the VFMU, which sends control signals to the VSU, hence to lower the supply voltage level. When the feedback provided by the razor flip-flops indicates timing errors, the scaling of the power supply is stopped. The proposed multiplier (Fig. 2) not only combines MP and DVS but also parallel processing (PP). Our multiplier comprises 8 × 8 bit reconfigurable multipliers. These building blocks can either work as nine independent multipliers or work in parallel to perform one, two or three 16 × 16 bit multiplications or a single-32 × 32 bit operation. PP can be used to increase the throughput or reduce the supply voltage level for low power operation. Fig. 3 shows the benefits of the different approaches being considered. Power consumption is a linear function of the workload, which is normally represented by the input operands precision. Curve 1 corresponds to the case of a fixed-precision (FP) multiplier using a fixed power supply. Region 1 shows the power optimization space for MP techniques, which use different-precision multiplications to reduce power. If one combines MP with DVS, power is further reduced with curves (1)–(3) becoming curves (4)–(6), respectively. Regions 1 and 2 show the power optimization space for the combined approach. Based on PP, the operating frequency could be decreased together with the supply voltage, as shown in curves (7) and (8). Finally, region 3 shows the optimization space for the proposed approach, which combines MP, DVS with PP. III. MP AND RECONFIGURABILITY OVERHEAD Fig. 4 shows the structure of the input interface unit, which is a submodule of the MP multiplier (Fig. 1). The role of this input interface unit (Fig. 4) is to distribute the input data between the nine independent processing elements (PEs) (Fig. 2) of the 32 × 32 bit MP multiplier, considering the selected operation mode. The input interface unit uses an extra MSB sign bit to enable both signed and unsigned Fig. 3. Conceptual view of optimization spaces of MP, DVS, and PP approaches. multiplications. A 3-bit control bus indicates whether the inputs are 1/4/9 pair(s) of 8-bit operands, or 1/2/3 pair(s) of 16-bit operands, or 1 pair of 32-bit operands, respectively. Depending on the selected operating mode, the input data stream is distributed (Fig. 4) between the PEs to perform the computation. Fig. 5 shows how three 8 × 8 bit PEs are used to realize a 16 × 16 bit multiplier. The 32 × 32 bit multiplier is constructed using a similar approach but requires 3 × 3 PEs. A 3-bit control word defines which PEs work concurrently and which PEs are disabled. Whenever the full precision (32 × 32 bit) is not exercised, the supply voltage and the clock frequency may be scaled down according to the actual workload. To evaluate the overhead associated to reconfigurability and MP, we define X and Y as the 2n-bits wide multiplicand and multiplier, respectively. XH, YH are their respective n most significant bits whereas XL, YL are their respective n least significant bits. XLYL , XHYL , XLYH, XHYH is the crosswise products. The product of X and Y can be expressed as follows: P = (XHYH )22n + (XHYL + XLYH)2n + XLYL (1) where 2n-bit reconfigurable multiplier can be built using adders and four n bit × n bit multipliers to compute XHYH, XHYL , XLYH, and XLYL . Table I shows that this would result in overheads of 18% and 13% for the silicon area and power, respectively. However, if we define [18] X = XH + XL (2) Y = YH + YL (3) then (1) could be rewritten as follows: P =(XHYH)22n+(XY −XHYH−XLYL )2n+XLYL . (4) Comparing (1) and (4), we have removed one n × n bit multiplier (for calculating XHYL or XLYH ) and one 2n-bit adder (for calculating XHYL + XLYH). The two adders are replaced with two n-bit adders (for calculating XH + XL and
  • 4. 762 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 X16_3//1[15:8] X16_2//1[15:8] X16_1//[15:8] X8_9//2 X8_4//2 0 X32_1//[15:8] MUX X16_3//1[7:0] X16_2//1[7:0] Y16_1//[7:0] X8_9//1 X8_4//1 X8_1// X32_1//[7:0] MUX X32_1//[31:24] X16_3//2[15:8] X16_2//2[15:8] X8_9//4 X8_4//4 0 MUX X32_1//[23:16] X16_3//2[7:0] X16_2//2[7:0] X8_9//3 X8_4//3 0 MUX X16_3//3[15:8] X8_9//6 0 MUX X8_9//5 0 MUX X8_9//8 0 MUX X8_9//7 0 MUX X16_3//3[7:0] X8_9//9 0 X MUX Y16_3//1[15:8] Y16_2//1[15:8] Y16_1//[15:8] Y8_9//2 Y8_4//2 0 Y32_1//[15:8] MUX Y16_3//1[7:0] Y16_2//1[7:0] Y16_1//[7:0] Y8_9//1 Y8_4//1 Y8_1// Y32_1//[7:0] MUX Y32_1//[31:24] Y16_3//2[15:8] Y16_2//2[15:8] Y8_9//4 Y8_4//4 0 MUX Y32_1//[23:16] Y16_3//2[7:0] Y16_2//2[7:0] Y8_9//3 Y8_4//3 0 MUX Y16_3//3[15:8] Y8_9//6 0 MUX Y8_9//5 0 MUX Y8_9//8 0 MUX Y8_9//7 0 MUX Y16_3//3[7:0] Y8_9//9 0 Y MUX 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 PE9 PE8 PE7 PE6 PE5 PE4 PE3 PE2 PE1 3-bit mode control Fig. 4. Structure of input interface unit. Fig. 5. Three PEs combined to form 16 × 16 bit multiplier. YH + YL) and two (2n + 2)-bit subtractors (for calculating XY − XHYH − XLYL ). In a 32-bit multiplier, we can thus significantly reduce the design complexity by using two 34-bit subtractors to replace a 16 × 16 bit multiplier. We actually need two 16 × 16 bit multipliers (for calculating XHYH and XLYL ) and one 17 × 17 bit multiplier (for calculating XY ). To evaluate the proposed MP architecture, a conventional 32-bit fixed-width multiplier and four sub-block MP mul-tipliers are designed using a Booth Radix-4 Wallace tree structure similar to that used for the building blocks of our MP three sub-block multiplier. These multipliers are synthesized using the synopsys design compiler with AMIS 0.35-μm complimentary metal-oxide-semiconductor (CMOS) standard cell technology library. The power simulations are performed at a clock frequency of 50 MHz and at a power supply of 3.3 V. Table I shows the implementation results including silicon area and power consumption for these multipliers. The proposed MP three sub-block architecture can achieve reductions of about 16% in power and 28% in area as compared with the conventional 32 × 32 bit fixed-width multiplier design. The TABLE I AREA AND POWER COMPARISON OF PROPOSED MP MULTIPLIERS AGAINST CONVENTIONAL FIXED-WIDTHMULTIPLIER RUNNING AT 50 MHz Schemes Power (mW) Area (mm2) 32-bit 39.62 0.624 fixed-width multiplier (100%) (100%) 32-bit 4 sub-block 44.76 0.736 MP multiplier (113%) (118%) 32-bit 3 sub-block 33.36 0.448 MP multiplier (84%) (72%) latter uses a Booth radix-4Wallace tree structure similar to that used in designing the building blocks of our MP multipliers. However, because of its larger size, the 32 × 32 bit fixed-width multiplier exhibits an irregular layout with complex interconnects. This limitation of tree multipliers happens to be addressed by our MP 32 × 32 bit multiplier, which uses a more regular design to partition, regroup, and sum partial products. IV. DYNAMIC VOLTAGE AND FREQUENCY SCALING MANAGEMENT A. DVS Unit In our implementation (Fig. 1), a dynamic power supply and a VCO are employed to achieve real-time dynamic voltage and frequency scaling under various operating conditions. In [28], near-optimal dynamic voltage scaling can be achieved when using voltage dithering, which exhibits faster response time than conventional voltage regulators. Voltage dithering uses power switches to connect different supply voltages to the load, depending on the time slots. Therefore, an intermediate average voltage is achieved. This conventional voltage dither-ing technique has some limitations. If the power switches are toggled with overlapping periods, switches can be turned on simultaneously, giving rise to a large transient current. To mitigate this, nonoverlapping clocks could be used to control power switches. However, this may result in system instability as there are instances where all supply voltages are disconnected from the load. The requirement for multiple supplies can also result in system overhead. To address these issues, we implemented a single-supply voltage dithering
  • 5. ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 763 (a) (b) Fig. 6. (a) Proposed single-header voltage dithering unit and voltage and frequency tuning loops. (b) Experimental timing results from voltage dithering unit. scheme [Fig. 6(a)], which operates as follows. When the sup-ply voltage (Vn) of the multiplier drops below the predefined reference voltage (Vref), the comparator output (Va) toggles. Therefore, the VFMU turns on the power switch via Vctrl, for a predefined duration Tc = 5 μs. The chosen value for the off-chip storage capacitor Cs is 4.7 μF. This value is chosen to achieve a voltage ripple magnitude of 50 mV [Fig. 6(b)] with a charging current set to 50 mA, hence to limit the resistive power loss of the dithering unit to less than 1% of the total power consumption. The value of Cs is a tradeoff between ripple magnitude, tracking speed, and area/power overheads. Fig. 6(b) shows experimental results for the voltage control loop. B. Dynamic Frequency Scaling Unit In the proposed 32 × 32 bit MP multiplier, dynamic frequency tuning is used to meet throughput requirements. It is based on a VCO implemented as a seven-stage current starved ring oscillator. The VCO output frequency can be tuned from 5 to 50 MHz using four control bits (5 MHz/step). This frequency range is selected to meet the requirements of general purpose DSP applications. The reported multiplier can operate as a 32-bit multiplier or as nine independent 8-bit multipliers. For the chosen 5–50 MHz operating range, our multiplier boasts up to 9 × 50 = 450 MIPS. The simulated power consumption for the VCO ranges from Fig. 7. Experimental measurement of worst case frequency switching (from 50 to 5 MHz). Fig. 8. Conceptual view of razor flip-flop [25]. 85 (5 MHz) to 149 μW (50 MHz), which is negligible com-pared with the power consumed by the multiplier. Fig. 7 shows experimental measurements showing the transient response for the worst case frequency switching (from 50 to 5 MHz). Clock frequency can settle within one clock cycle as required. V. IMPLEMENTATION OF RAZOR FLIP-FLOPS Although the worst case paths are very rarely exercised, tra-ditional DVS approaches still maintain relatively large safety margins to ensure reliable circuit operation, resulting in exces-sive power dissipated. The razor technology is a breakthrough work, which largely eliminates the safety margins by achieving variable tolerance through in-situ timing error detection and correction ability [25]. This approach is based on a razor flip-flop, which detects and corrects delay errors by double sampling. The razor flip-flop (Fig. 8) operates as a standard positive edge triggered flip-flops coupled with a shadow latch, which samples at the negative edge. Therefore, the input data is given in the duration of the positive clock phase to settle down to its correct state before being sampled by the shadow latch. The minimum allowable supply voltage needs to be set, hence the shadow latch (Fig. 8) always clocks the correct data even for the worst case conditions. This requirement is usually satisfied given that the shadow latch is clocked later than the main flip-flop. A comparator flags a timing error when it detects a discrepancy between the speculative data sampled at the main flip-flop and the correct data sampled
  • 6. 764 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 V C O Voltage Scaling Unit Multiplier Razor flip-flops Fig. 9. Microphotograph of 32 × 32 bit MP multiplier. at the shadow latch. The correct data would subsequently overwrite the incorrect signal. The key idea behind razor flip-flops is that if an error is detected at a given pipeline stage X, then computations are only re-executed through the following pipeline stage X + 1. This is possible because the correct sampled value would be held by the shadow latch [25]. This approach ensures forward progress of data through the entire pipeline at the cost of a single-clock cycle [25]. An error correction mechanism, based on global clock gating, is implemented in the proposed multiplier [25]. In this correction scheme, error and clock signals are used to deter-mine when the entire pipeline needs to be stalled for a single-clock cycle. Fig. 1 shows that a global error signal is fed to the VFMU so as to alert the controlling unit whenever the current operating voltage is lower than necessary. The VFMU will then increase the voltage reference. This will in turn result in the VSU generating a new supply voltage level based on the new target voltage reference. When an error occurs, results can be recomputed at any pipeline stage using the corresponding input of the shadow latch. Therefore, the correct values can be forwarded to the corresponding next stages. Given that all stages can carry out these recomputations in parallel, the adopted global clock gating can tolerate any number of errors within a given clock cycle [25]. After one clock cycle, normal pipeline operation can resume. The actual implementation of razor flip-flops requires careful design to meet timing constraints and avoid system failure. For example, the use of a delayed clock for the shadow latch (Fig. 8) makes it possible for a short-path in the combinational logic to corrupt the data in the shadow latch [25]. This imposes a short-path delay constraint at the input of each razor flip-flop of our multiplier. To meet these constraints across all corners, we inserted delay buffers through all short paths found by Cadence silicon-on-chip (SOC) Encounter and validated them through Prime Time. In addition, precautions are used to mitigate metastability by inserting a metastability detector at the output of each main flip-flop. The outputs of the metastability detector and the error comparator (Fig. 8) are ORED to generate the error signal of individual razor flip-flops [25], [26]. These razor error signals are OR-ED together to form a global error signal used to ensure that all valid data in the shadow TABLE II PROTOTYPE CHARACTERISTICS Technology node 0.35 μm Die size 1.5 × 1.0 mm Total number of transistors 37656 Measured chip power at 3.3 V 39 mW DVS supply voltage range 0–3.3 V DFS clock frequency range 5–50 MHz Total number of flip-flops 144 Number of razor flip-flops 13 Standard D flip-flop power 57 μW Razor flip-flop power (static/switching) 70/239 μW Total power overhead of razor flip-flops 2.3% latches is restored into the main flip-flops before the next clock cycle. The adopted design for the metastability detector is that proposed in [26]. This metastability detector relies on skewed inverters, which require careful simulation through all process corners to ensure proper operation [26]. When implementing razor-based DVS, it is essential that the resulting power/delay overhead be kept to a minimum, hence not to severely limit the benefits brought by aggressive supply voltage scaling. In the case of our multiplier, only 13 out of a total of 144 flip-flops that is 9% of the flip-flops are found not to meet timing constraints under worst case level of the supply voltage (Table II). Therefore, only these 13 critical paths are equipped with razor flip-flops. These 13 near-critical paths are identified through Cadence SOC Encounter and validated using Prime Time. At a supply voltage of 3.3 V and operating frequency of 50 MHz, the razor flip-flop is found to consume 1.2 times more static/switching power (70/57 μW) when no timing errors are detected. In the other case, it consumes 4.2 times more static/switching power (239/57 μW). However, for a conservative activity factor of 1%, the power overhead due to razor flip-flops was estimated to be less than 2.3% of the nominal chip power because only 9% of the flip-flops were made razor flip-flops. Therefore, both the silicon area and power overheads associated to razor flip-flops are found to be negligible. In regard to the razor flip-flop’s delay overhead, it is mainly because of the additional multiplexer at its input as well as the increased fan-out resulting from the introduction of comparator, metastability detector, and OR gates at the output. At a supply voltage of 3.3 V and operating frequency of 50 MHz, delay overheads are found to be 1.20% and 3.58% for error-free and error-occurring cases, respectively. These delay overheads constitute a small penalty for the massive power reduction enabled by razor-based DVS. VI. PERFORMANCE EVALUATION AND DISCUSSION We designed and fabricated a 32 × 32 bit reconfigurable multiplier in AMIS 0.35-μm technology. The die photograph of the multiplier system is shown in Fig. 9 and the chip characteristics are shown in Table II. The operating mode
  • 7. ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 765 32 bit mode 16 bit mode 8 bit mode 5 10 15 20 25 30 35 40 45 50 2.5 2.0 1.5 1.0 Minimum Voltage (V) Frequency (MHz) Fig. 10. Experimental results of minimum voltage supply for different precisions and operating frequencies. of the multiplier is controlled by three external signals. The operating voltage and frequency are tuned automatically depending on the actual workload of the multiplier. The chip is tested by feeding in randomly generated operands and comparing the outputs with results from a PC processing the same data. The 32-bit precision data sets include data with an effective word-length of 17–32 bits. The 16-bit precision data sets and 8-bit precision data sets include data with an effective word-length of 9–16 bits and 0–8 bits, respectively. We achieved full functionality across a voltage range of 0.8–3.3 V, and a frequency range of 5–50 MHz. Fig. 10 shows the relation between the minimum supply voltage and operating frequency for different precision modes. As explained, razor energy savings are a result of the elimination of safety margins and processing below the first failure voltage. By scaling the voltage below the first failure point, an error rate of 0.1% is maintained and the power consumption is measured at this minimum possible voltage. For an operating frequency of 50 MHz, the supply voltage is set to 2.45, 1.95, and 1.80 V for the 32, 16, and 8-bit modes, respectively. For lower operating frequencies, the required supply voltage levels are much lower, as shown in Fig. 10. The chip power consumption for different operating modes is shown in Fig. 11. For 16-bit operands, 55.6% (17.35 versus 39.04 mW) power reduction can be obtained by the MP scheme. When the DVS technique is applied, the chip consumes 6.06 mW at the first failure point at an optimal 0.1% error rate, leading to a further 65.1% (6.06 versus 17.35 mW) power saving. Based on PP feature enabled, the operating frequency can be scaled to 1/3 of the original one, therefore the voltage would be tuned down to a much lower level for an additional 46.7% (3.23 versus 6.06 mW) power reduction. For 8-bits operands, the MP, DVS, and PP schemes can help save 87.4% (4.90 versus 39.04 mW), 70.2% (1.46 versus 4.90 mW), and 55.5% (0.65 versus 1.46 mW) power, respectively. Fig. 12 shows experimental results showing the power sav-ings associated to the MP, razor-based DVS, and PP features 40 35 30 25 20 15 10 5 0 8-b mode with DVS and PP 16-b mode without DVS 8-b mode with DVS 8-b mode without DVS 16-b mode with DVS and PP 16-b mode with DVS 32-b mode Without MP nor DVS nor PP MP with DVS with PP MP MP with DVS Power Consumption (mW) Fig. 11. Experimental results of power consumption of different operating schemes. Fig. 12. Experimental data showing power optimization spaces associated to MP, razor-based DVS, and PP schemes. of the fabricated 32 × 32 bit multiplier. Region 1 is the power optimization space for MP whereas Regions 2(a) and (b) are the power optimization spaces for the DVS technique without and with razor, respectively. Finally, region 3 is the power optimization space for PP. Fig. 12 shows that when MP is combined with DVS, power consumption is reduced to 29.12, 8.07, and 1.98 mW (points , , and in Fig. 12) for 32, 16, and 8-bit multiplications, respectively. In addition, razor flip-flops help reduce the operating voltage to the minimum possible level, resulting in a further power reduction of 26.1% (from 29.12 to 21.52 mW, point to , 24.9% (from 8.07 to 6.06 mW, point to , and 26.3% (from 1.98 to 1.46 mW, point to ) for 32, 16, and 8-bit precision, respectively. Based on PP, the power reduction space is further enlarged. Table III compares the performance of the fabricated prototype with related works. [7], [20], [21] correspond to FP voltage schemes whereas [5], [22], [23] are MP, fixed voltage schemes. [9], [10] are MP, multivoltage schemes. To compare the silicon area associated to each scheme, we chose to use the number of
  • 8. 766 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 transistors because it constitutes a fair metric to compare dif-ferent CMOS technology nodes. From Table III, the proposed multiplier provides the most reconfigurability while exhibit-ing the smallest relative area. Compared with the designs with the same maximum word-length of 32-bit [21], [23], our design boasts a much smaller area. For design [7] and design [5], their maximum word-length is 16-bit instead of 32-bit. If we assume that a 32 × 32 bit multiplier is built using three 16 × 16 bit multipliers, then the area of the 32-bit multiplier is at least three times that of the 16-bit multiplier, discarding the glue and reconfigurability logic. This shows that the proposed multiplier outperforms reported implementations whether considering silicon area or reconfigurability. In regard to power dissipation, Table III shows normalized power results (using P = CV2 f α0−1) to cater for the different technolo-gies previously reported. As in previous works, we consider random input test patterns, with activity factors determined using models describing the propagation of the input statistics to the output of data-path operators [29]. Normalized power results show that the proposed multiplier outperforms reported implementations in terms of power dissipation. In previous works, flexibility and reconfigurability have come at a cost of increased silicon area and power consumption. In this paper, we propose an implementation that not only provides MP reconfigurable datapath, but also obtains a reduction in both silicon area and power, as compared with FP multipliers. In more advanced deep submicrometer processes, the proposed MP multiplier with razor DVS offers the ability to compensate for process variations. It would also be essential to integrate leakage reduction techniques [30], hence to jointly minimize leakage and dynamic power consumption. VII. INPUT OPERANDS SCHEDULER A. Motivation and Operation Principle In previous section, we report experimental results obtained using different data sets, each composed of randomly generated single-precision operands. However, in some applications such as artificial neural network applications, the input data stream could include mixed-precision operands [1]. Although our multiplier provides three different precision modes (32 × 32 bit, 16 × 16 bit, 8 × 8 bit), the supply voltage would still have to transit dynamically between the minimum required voltage levels Vmin32, Vmin16, or Vmin8 required for 32, 16, and 8-bit operands, respectively. Fig. 10 shows that given a certain operating frequency, the differ-ence among Vmin32, Vmin16, and Vmin8 can be in the range of 0.1–0.65 V. If the input data stream requires frequent supply voltage transitions, significant dynamic power would be dissipated, thereby undermining the benefits of DVS. In addition, these transitions may not always be possible within one clock cycle. To minimize the overall power consumption, one needs to reduce the number of supply voltage transitions while still processing operands at the minimum required voltage level. To address this problem, we propose an IOS that will perform the following tasks: 1) reorder the input data stream such that same-precision operands are grouped together into a buffer (Fig. 13) and 2) find the minimum supply voltages (Vmin32, Vmin16, Vmin8), and operating frequencies ( f32, f16, f8) for the three different-precision data groups to minimize the overall power consumption while still meeting the specified throughput. The block diagram of the IOS is shown in Fig. 13. It is composed of an operand range detector, a pattern generation engine, a 2 k-bit buffer-(RAM), and a frequency/voltage ana-lyzer. The scheduler operates as follows. The inputs operands are first sent to the range detector, which classifies them according to their precision: 32, 16, or 8-bit. The classified data is then grouped by the pattern generation engine, which packs same-precision data into three different 32-bit data patterns (Fig. 13): 1) pattern 1 corresponds to original 32-bit input operand Data; 2) pattern 2 combines two 16-bit operands data (with their redundant 16 MSBs removed); and 3) pattern 3 combines four 8-bit operand data (with their redundant 24 MSBs removed). At each clock cycle, a 32-bit data pattern can be processed, owning to the PP capability of the proposed multiplier. This resembles the SIMD structure, and helps to put the MP and PP capability into real effect. As in Fig. 13, the three different data patterns are counted (N32, N16, and N8) and stored into a buffer, together with the respective voltages and clock frequencies at which they should be processed. For each full buffer, there will only be two transitions needed: (Vmin8, f8)–(Vmin16, f16), and (Vmin16, f16)–(Vmin32, f32). To limit the silicon area overhead, we chose a 2k-RAM, which can store 60 32-bit data patterns. The voltage/frequency analyzer specifies the values of Vmin32, Vmin16, Vmin8, f32, f16, and f8 to the dithering unit and VCO. The Vmin– f pairs are determined during the characterization of the chip and stored in the LUT (Fig. 13). B. Problem Formulation Given a random mixed-precision (32-, 16-, or 8-bit) input data stream and specified throughput Tp, our goal is to determine the voltages (Vmin32, Vmin16, Vmin8) and frequencies ( f32, f16 and f8) at which each precision data group should be processed such that the total power consumption is minimized. In the following analysis, we consider the following four components of the total power consumption: 1) the resis-tive power loss Pdith_resistic_loss of the dithering unit; 2) the switching power loss Pdith_switching_loss of the dithering unit; 3) the dynamic power consumption Pcomputation associated to the multiplication computation; and 4) finally, Pcompu_overhead that corresponds to the power consumption of the latter computation when carried out at voltage levels higher than the nominal Vmin. The equations of the aforementioned four components of the total power consumption are given below Pdith_resistic_loss = I 2 char Ron (5) where Ichar is the charge current of the dithering unit, and Ron is the equivalent resistance of the dithering switch 2 f N Pdith_switching_loss = CgVdd (6) where Cg is the gate capacitance of the dithering switch, Vdd is the 3.3 V standard voltage, and N is the number of input
  • 9. ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 767 TABLE III PERFORMANCE COMPARISON OF PROPOSED MULTIPLIERWITH RELATED WORKS Pattern 1 32-b ope Pattern 2 Pattern 3 8-b ope 8-b ope 8-b ope 8-b ope Range Detector Pattern Generation Engine Buffer (RAM) Vmin-freqency Look-up-table Voltage Frequency Analyzer Input Operands Stream Dithering Unit VCO Multi-precision Multiplier 32-b ope 32-b ope .. 32-b ope 16-b ope 16-b ope 16-b ope 16-b ope 16-b ope 16-b ope .. 16-b ope 16-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope 8-b ope .. 8-b ope 8-b ope 8-b ope 8-b ope Pattern 1 data Pattern 2 data Pattern 3 data 16-b ope 16-b ope Algorithm A Algorithm B Algorithm C Fig. 13. Block diagram of IOS. data patterns Pcomputation = CmVmin 2 f (7) where Cm is the effective capacitance of the multiplier, Vmin is the applied minimum supply voltage, and f is the applied operating frequency Pcompu_overhead = Pdt T = CmV 2 f dV T (8) where V is the dithering unit output, which fluctuates around Vmin, and T is the charge time period, which is inversely proportional to the operating frequency. The overall power consumption is thus given by Poverall = Pcomputation + Pcompu_overhead +Pdith_resistic_loss + Pdith_switching_loss. (9) In the following, we present three different algorithms to reduce this overall power consumption. Each of these algorithms constitutes a different approach to process the mixed-precision data held in the operands buffer (Fig. 13). The performance of each algorithm is evaluated using a mixed-precision data set of 120 000 randomly operands, with a third corresponding to each precision (8-, 16-, and 32-bit). In the following, the specified throughput Tp for the proposed
  • 10. 768 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 Fig. 14. Operation principles of operand scheduling algorithms A, B, and C. Data Block X and Data Block X+1 refer to two-consecutive operand data blocks subsequently stored into the RAM, respectively. TABLE IV DETAILED POWER PERFORMANCE OF DIFFERENT SCHEDULING ALGORITHMS Algorithm P_computation P_compu_overhead P_dith_resistic_loss P_dith_switching_loss P_overall A 3.034 mW 3.159 mW 0.059 mW 1.715 mW 8.255 mW B 2.266 mW 2.565 mW 0.084 mW 1.663 mW 6.578 mW C 1.682 mW 1.843 mW 0.062 mW 0.975 mW 4.561 mW 32 × 32 bit multiplier is 64 F (Mbits/s), where F is the multiplier’s operating frequency. C. Algorithm A In the first algorithm, the multiplier throughput Tp = 64 F is kept constant by fixing the operating frequencies ( f32−, f16−, or f8) of each precision-data group (32-, 16-, or 8-bit) to f32 = F, f16 = F 2 , f8 = F 4 (10) where F is the multiplier’s operating frequency. This is because the throughput in 8 × 8 bit multiplication mode is four times that of the 32 × 32 bit multiplication mode and double that of the 16 × 16 bit multiplication mode, as a result of the multiplier PP. The minimum supply voltage (Vmin32, Vmin16 or Vmin8) associated to each operating frequency ( f32, f16 or f8) is determined through a Vmin– f LUT. Algorithm A shows its limitations when 32-bit operands are processed initially. As shown in Fig. 14, once all N32 operands of the data block are processed, the supply voltage (Vn) needs to decrease rapidly from point A (Vmin32) to point B (Vmin16) at which all N16 16-bit operands of the data block should be processed. If N16 is too small, most 16-bit operands will be actually processed in Sections A and B, that is at a voltage possibly much higher than the minimal Vmin16 level. Similarly 8-bit operands of the data block could be processed in Sections C and D, B-C, or even A-B for the worst case. This contributes to increasing Pcompu_overhead. The overall power performance of algorithm A is shown in Table IV. Compared with the fixed-width 32 × 32 bit standard multiplier (32 × 32 bit mode must be chosen given that a third of operands are 32-bit), 77.7% total power reduction is achieved with a total silicon area overhead of only 11.1%, when considering DVS, razor, RAM, and dedicated circuitry for scheduling algorithm A. D. Algorithm B This algorithm removes all transitions of the power supply voltage by making Vmin32, Vmin16, and Vmin8 equal and adjust-ing f32, f16, and f8 such that the overall throughput is kept unchanged. We thus need to have the following: 64N32 + 128N16 + 256N8 N32 f32 + N16 f16 + N8 f8 = 64 F. (11) From a LUT, we can obtain the Vmin– f relationship as follows: Vmin32 = ψ32( f32) (12) Vmin16 = ψ16( f16) (13) Vmin8 = ψ8( f8). (14) As algorithm B keeps the supply voltage constant ψ32( f32) = ψ16( f16) = ψ8( f8) = V (15)
  • 11. ZHANG et al.: MULTIPRECISION RAZOR-BASED DYNAMIC VOLTAGE SCALING MULTIPLIER 769 the operating frequencies f32, f16, and f8 can be determined by using (11) and (15). For example, when F is set to 50 MHz, the values for V , f32, f16, and f8 are found to be 1.35 V, 20 MHz, 25 MHz, and 35 MHz, respectively. The overall power consumption of algorithm B is shown in Table IV. Due to the complete removal of voltage transitions, the Pcompu_overhead is reduced. Simultaneously, because of holistic planning, the dynamic computation power is also optimized to a lower level. Compared with the fixed-width 32 × 32 bit standard multiplier, 81.5% power reduction is achieved with a total silicon area overhead of only 11.9%, when considering DVS, razor, RAM, and dedicated circuitry for scheduling Algorithm B. E. Algorithm C Although Algorithm B removes power supply voltage tran-sitions by setting a single-voltage level V, there may be better power saving combinations of power supply voltages and operating frequencies: (Vmin32, f32), (Vmin16, f16), and (Vmin8, f8). The aim of algorithm C is to find such an optimum for reduced power consumption. To limit complexity, we will only seek to minimize the dynamic power dissipated as a result of the computation P = CV2 f (16) = Cm32V2 min32 f32 + Cm16V2 min16 f16 + Cm8V 2 min8 f8 (17) = χ( f32, f16). (18) Given that the Vmin– f relationships are known (12)–(14), one could find the minimum of the above equation for the specified throughput (11). For example, when F is set to 50 MHz, the values for (Vmin32, f32), (Vmin16, f16), (Vmin8, f8) are found to be (1.15 V, 15 MHz), (1.30 V, 20 MHz), and (1.75 V, 45 MHz), respectively. The overall power perfor-mance of algorithm C is shown in Table IV. When consid-ering DVS, razor, RAM, and dedicated scheduling circuitry, algorithm B exhibits the least power consumption, with an overall power reduction of 86.3%, compared with the standard 32 × 32 bit fixed-width multiplier. However, it requires two additional dithering units to generate all three discrete power supply levels Vmin32, Vmin16, and Vmin8 and thus remove transitions among these different supply levels. This increases the total silicon area overhead to 27.1%. Therefore, algorithm B provides the most attractive tradeoff with 81.5% reduction and a silicon area overheard of just 11.9%. VIII. CONCLUSION We proposed a novel MP multiplier architecture featuring, respectively, 28.2% and 15.8% reduction in silicon area and power consumption compared with its 32 × 32 bit conven-tional fixed-width multiplier counterpart.When integrating this MP multiplier architecture with an error-tolerant razor-based DVS approach and the proposed novel operands scheduler, 77.7%–86.3% total power reduction was achieved with a total silicon area overhead as low as 11.1%. The fabricated chip demonstrated run-time adaptation to the actual workload by operating at the minimum supply voltage level and mini-mum clock frequency while meeting throughput requirements. The proposed novel dedicated operand scheduler rearranges operations on input operands, hence to reduce the number of transitions of the supply voltage and, in turn, minimized the overall power consumption of the multiplier. The proposed MP razor-based DVS multiplier provided a solution toward achiev-ing full computational flexibility and low power consumption for various general purpose low-power applications. ACKNOWLEDGMENT The authors would like to thank Dr. M.K. Law for his comments and discussions.We also would like to acknowledge Mr. S.F. Luk for his help with the chip test measurements. REFERENCES [1] R. Min, M. Bhardwaj, S.-H. Cho, N. Ickes, E. Shih, A. Sinha, A. Wang, and A. Chandrakasan, “Energy-centric enabling technologies for wire-less sensor networks,” IEEE Wirel. Commun., vol. 9, no. 4, pp. 28–39, Aug. 2002. [2] M. Bhardwaj, R. Min, and A. Chandrakasan, “Quantifying and enhanc-ing power awareness of VLSI systems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp. 757–772, Dec. 2001. [3] A. Wang and A. Chandrakasan, “Energy-aware architectures for a real-valued FFT implementation,” in Proc. IEEE Int. Symp. Low Power Electron. Design, Aug. 2003, pp. 360–365. [4] T. Kuroda, “Low power CMOS digital design for multimedia proces-sors,” in Proc. Int. Conf. VLSI CAD, Oct. 1999, pp. 359–367. [5] H. Lee, “A power-aware scalable pipelined booth multiplier,” in Proc. IEEE Int. SOC Conf., Sep. 2004, pp. 123–126. [6] S.-R. Kuang and J.-P. Wang, “Design of power-efficient configurable booth multiplier,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 3, pp. 568–580, Mar. 2010. [7] O. A. Pfander, R. Hacker, and H.-J. Pfleiderer, “A multiplexer-based concept for reconfigurable multiplier arrays,” in Proc. Int. Conf. Field Program. Logic Appl., vol. 3203. Sep. 2004, pp. 938–942. [8] F. Carbognani, F. Buergin, N. Felber, H. Kaeslin, and W. Fichtner, “Transmission gates combined with level-restoring CMOS gates reduce glitches in low-power low-frequency multipliers,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 7, pp. 830–836, Jul. 2008. [9] T. Yamanaka and V. G. Moshnyaga, “Reducing multiplier energy by data-driven voltage variation,” in Proc. IEEE Int. Symp. Circuits Syst., May 2004, pp. 285–288. [10] W. Ling and Y. Savaria, “Variable-precision multiplier for equalizer with adaptive modulation,” in Proc. 47th Midwest Symp. Circuits Syst., vol. 1. Jul. 2004, pp. I-553–I-556. [11] K.-S. Chong, B.-H. Gwee, and J. S. Chang, “A micropower low-voltage multiplier with reduced spurious switching,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 2, pp. 255–265, Feb. 2005. [12] M. Sjalander, M. Drazdziulis, P. Larsson-Edefors, and H. Eriks-son, “A low-leakage twin-precision multiplier using reconfigurable power gating,” in Proc. IEEE Int. Symp. Circuits Syst., May 2005, pp. 1654–1657. [13] S.-R. Kuang and J.-P. Wang, “Design of power-efficient pipelined truncated multipliers with various output precision,” IET Comput. Digital Tech., vol. 1, no. 2, pp. 129–136, Mar. 2007. [14] J. L. Holt and J.-N. Hwang, “Finite precision error analysis of neural network hardware implementations,” IEEE Trans. Comput., vol. 42, no. 3, pp. 281–290, Mar. 1993. [15] A. Bermak, D. Martinez, and J.-L. Noullet, “High-density 16/8/4-bit configurable multiplier,” Proc. Inst. Electr. Eng. Circuits Devices Syst., vol. 144, no. 5, pp. 272–276, Oct. 1997. [16] T. Kuroda, “Low power CMOS digital design for multimedia proces-sors,” in Proc. Int. Conf. VLSI CAD, Oct. 1999, pp. 359–367. [17] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen, “A dynamic voltage scaled microprocessor system,” IEEE J. Solid-State Circuits, vol. 35, no. 11, pp. 1571–1580, Nov. 2000. [18] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba, Y. Watanabe, K. Matsuda, T. Maeda, T. Sakurai, and T. Furuyama, “Variable supply-voltage scheme for low-power high-speed CMOS digital design,” IEEE J. Solid-State Circuits, vol. 33, no. 3, pp. 454–462, Mar. 1998.
  • 12. 770 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 4, APRIL 2014 [19] M. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo, A. Hashiguchi, H. Kawahara, K. Kumano, and M. Shimura, “Dynamic voltage and frequency management for a low-power embedded micro-processor,” IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 28–35, Jan. 2005. [20] J.-Y. Kang and J.-L. Gaudiot, “A simple high-speed multiplier design computers,” IEEE Trans. Comput., vol. 55, no. 10, pp. 1253–1258, Oct. 2006. [21] G. Y. Jeong, J. S. Park, and H. C. Kang, “A Study on multiplier architecture optimized for 32-bit processor with 3-stage pipeline,” in Proc. Int. SoC Design Conf., Oct. 2004, pp. 656–660. [22] S. Perri, P. Corsonello, M. A. Iachino, M. Lanuzza, and G. Cocorullo, “Variable precision arithmetic circuits for FPGA-based multimedia processors,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 9, pp. 995–999, Sep. 2004. [23] S. D. Haynes, A. Ferrari, and P. Y. K. Cheung, “Flexible reconfigurable multiplier blocks suitable for enhancing the architecture of FPGAs,” in Proc. IEEE Custom Integr. Circuits, May 1999, pp. 191–194. [24] S. Das, D. Blaauw, D. Bull, K. Flautner, and R. Aitken, “Addressing design margins through error-tolerant circuits,” in Proc. Design Autom. Conf., Jul. 2009, pp. 11–12. [25] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “Razor: A low-power pipeline based on circuit-level timing speculation,” in Proc. Int. Symp. Microarchit., Dec. 2003, pp. 7–18. [26] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, T. Mudge, and K. Flautner, “A self-tuning DVS processor using delay-error detection and correction,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 792–804, Apr. 2006. [27] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T. Blaauw, “RazorII: In situ error detection and correction for PVT and SER tolerance,” IEEE J. Solid-State Circuits, vol. 44, no. 1, pp. 32–48, Jan. 2009. [28] B. Calhoun and A. Chandrakasan, “Ultra-dynamic voltage scaling using sub-threshold operation and local voltage dithering in 90 nm CMOS,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2005, pp. 300–301. [29] E. D. Kyriakis-Bitzaros and S. Nikolaidis, “Estimation of bit-level tran-sition activity in datapaths based on word-level statistics and conditional entropy,” IEE Proc. Circuits, Devices Syst., vol. 149, no. 4, pp. 234–240, Aug. 2002. [30] A. Youssef, M. Anis, and M. Elmasry, “A comparative study between static and dynamic sleep signal generation techniques for leakage tolerant designs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9, pp. 1114–1126, Sep. 2008. Xiaoxiao Zhang (S’06) received the B.S. degree from the Department of Microelectronics, Tianjin University, Tianjin, China, and the M.E. degree from the Institute of Microelectronics, Chinese Academy of Sciences, Beijing, China, in 2003 and 2006, respectively. She is currently pursuing the Ph.D. degree with the Electronic and Computer Engineer-ing Department, Hong Kong University of Science and Technology, Hong Kong. Her Ph.D. research work involves the design of low-power real-time digital image processing (DIP) cores or modules for a camera-on-a-chip. Her current research interests include low-power and high-performance VLSI circuits design, signal processing architectures, face detection, and 3-D object/face recognition. Farid Boussaid (M’00–SM’04) received the M.S. and Ph.D. degrees in microelectronics from the National Institute of Applied Science (INSA), Toulouse, France, in 1996 and 1999, respectively. He joined Edith Cowan University, Perth, Aus-tralia, as a Postdoctoral Research Fellow, and a member of the Visual Information Processing Research Group in 2000. He joined the University of Western Australia, Crawley, Australia, in 2005, where he is currently an Associate Professor. His current research interests include smart CMOS vision sensors, gas sensors, neuromorphic systems, device simulation, mod-eling, and characterization in deep submicron CMOS processes. Amine Bermak (M’99–SM’04–F’13) received the M.Eng. and Ph.D. degrees in electronic engineering from Paul Sabatier University, Toulouse, France, in 1994 and 1998, respectively. He joined the Advanced Computer Architecture Research Group, York University, York, U.K., where he was working as a Post-Doctoral Fellow on VLSI implementation of CMM neural network for vision applications in a project funded by British Aerospace. He joined Edith Cowan University, Perth, Australia, in 1998, first as a Research Fellow working on smart vision sensors, then as a Lecturer and a Senior Lecturer. He is currently a Professor with the Electronic and Computer Engineering Department, Hong Kong University of Science and Technology (HKUST), Hong Kong. His current research interests include VLSI circuits and systems for signal, image processing, sensors, and microsystems applications. Dr. Bermak was a recipient of many distinguished awards, including the 2004 “IEEE Chester Sall Award,” the HKUST “Engineering School Teaching Excellence Award” in 2004 and 2009, and the “Best Paper Award” at the 2005 International Workshop on System-On-Chip for Real-Time Applications.