SlideShare a Scribd company logo
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 1
Area Efficient and Reduced Pin Count Multipliers
Omar Nibouche o.nibouche@tu.edu.sa
College of Computers and Information Technology
Taif University
POB 888 Taif 21964, KSA
Abstract
Fully serial multipliers can play an important role in the implementation of DSP algorithms in
resource-limited chips such as FPGAs; offering area efficient architectures with a reduced pin
count and moderate throughput rates. In this paper two structures that implement the fully serial
multiplication operation are presented. One significant aspect of the new designs is that they are
systolic and require near communication links only. They are superior in speed and area usage to
similar architectures in the literature. The paper also present a new fully serial multiplier optimized
for area-time
2
efficiency with better performance than available architectures in the open
literature.
Keywords: Educed Pin Count, Serial Multiplication, Area-Time
2
.
1. INTRODUCTION
There is a crucial advantage offered by bit-serial processors over their parallel counterpart, which
lies in the very efficient use of chip area. They are particularly suitable for applications that require
slow to moderate speeds and in batch mode applications. By contrast, bit-parallel processors are
useful for fast speed systems, but at the expense of larger a area usage and thus they are more
expensive [1-2].
Traditional bit serial multiplier structures suffer from an inefficient generation of partial products,
which leads to hardware overuse and slow speed systems. In this paper, two structures for bit serial
multiplication are presented. The first structure, called structure I, is the first fully serial multiplier
reported in the literature with comparable performance - in terms of speed- to existing serial-parallel
multipliers. The second structure, termed structure II, requires an extra multiplexer in the clock path;
thus making it slower, but has the merit of reducing the latency of the multiplier.
The remainder of the paper is organised as follows: in section 2, the previous work in the literature is
reviewed, while section 3 describes the new structures for fully serial multiplication. Section 4 is
concerned with an optimisation of the multiplier of Structure I in terms of area-time
2
efficiency. A
comparison of performance is shown in section 5 and conclusions are given in section 6.
2. BIT SERIAL MULTIPLIERS: A REVIEW
One of the early bit serial multipliers was proposed by [3]. It generates the partial products in a
recursive fashion. Consider the multiplication of two n-bit positive numbers A and B as follows:
∑
−=
=
=
1
0
2
ni
i
i
iaA (1)
and
∑
−=
=
=
1
0
2
ni
i
i
ibB (2)
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 2
Let Pi represents the partial product computed after the i
th
bit is fed [3]. Pi is given by:
i
2
111 22 ba)AbB(aPP i
i
i-ii-i
i
i-i +++= (3)
where Ai and Bi represent the value of the operands A and B, respectively, and by considering only
bits from the Least Significant Bit (LSB) to the i
th
bit, that is,
Ai = A mod 2
i+1
(4)
and
Bi = B mod 2
i+1
(5)
with the initial values P-1 = A-1 = B-1 = 0.
The generation of the partial product, using equation (3), and their assignment to the multiplier cells
is shown in Table 1 below:
Cycle P Cell 1 Cell 2
1
2
3
4
5
6
7
8
a3
b3
Cell 3 Cell 4
a2
b3
+a3
b2
P0
a2b2
a1
b1
a0
b0
a1b2+a2b1
a0
b1
+a1
b0
a0
b2
+a2
b0
a1
b3
+a3
b1
a0
b3
+a3
b3
P1
P2
P3
P4
P5
P6
P7
TABLE 1: Multiplication Scheme of [3].
From the above table, each cell generates two new bit-products every cycle. To cope with this
constraint, The Basic Cell (BC) of the multiplier proposed by [3] is built around a 5 to 3 counter. The
counter is capable of accumulating five inputs of the same weight to a sum-bit (Sout) of the same
weight as the inputs, i.e.2
0
, 1
st
carry-bit (C
1
out) of a weight of 2
1
and 2
nd
carry-bit (C
2
out) of a weight of
2
2
. In particular, the sum-bit is calculated through a tree of EXOR gate to reduce the propagation
delay within the cell. The BC of the multiplier is shown in Figure 1. The multiplier uses n identical
cells to perform the multiplication of two n-bit numbers in 2n cycles, as it can be seen in Figure 2.
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 3
SoutC1
out C2
out
Cin
Sin
ai
bi
control
5to3 counter
Sout
Cout
Latch
FA
FA
HA
22
21
20
20
2020
20
20
(b) (a)
FIGURE 1 (a): 5 to 3 Counter Made of Two FAs and One HA (b). The BC of [3].
BC n BC 2 BC 1
ai
Si
0
control
bi
FIGURE 2: The n-bit Multiplier by [3].
In [4], modifications were carried out on the multiplication scheme of Table 1 to make a more efficient
use of the hardware. In fact, the multiplier proposed by [4] uses only half the number of cells required
by [3]. To achieve this, the partial products generated by the last n/2 cells in [3] were reallocated to
the first n/2 cells and rescheduled at cycle n+1. In this way, a full utilisation of cells 1 to n/2 can be
achieved, as it can be shown in Table 2.
Cycle P Cell 1 Cell 2
1
2
3
4
5
6
7
8
a3
b3
a2b3+a3b2
P0
a2
b2
a1
b1
a0b0
a1b2+a2b1
a0b1+a1b0
a0b2+a2b0
a1
b3
+a3
b1
a0
b3
+a3
b3
P1
P2
P3
P4
P5
P6
P7
TABLE 2: Multiplication Scheme by [4].
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 4
The multiplier was modified using little extra hardware: two n/2 shift registers are used to store the
n/2 most significant bits of the data words A and B and n/2 multiplexers. The BC of the modified
multiplier d is shown in Figure 3. The multiplier structure as proposed by [4] is shown in Figure 4.
This architecture has reduced the number of 5 to 3 counters by 50%, as well as reducing the number
of latches by 33%. The clock path equals the delay of a multiplexer, a gated counter and a latch.
1
0
5to3 counter
Sout
C1
out
C2
out
Cin
Sin
ai
bi
control
Cout Sout
Latch
FIGURE 3: The BC of [4].
BC N/2 BC 2 BC 1
1
0 ai
Si
0
control
1
0
n/2 Shift Register
n/2 Shift Register bi
FIGURE 4: The n-bit Multiplier by [4].
3. THE NEW FULLY SERIAL MULTIPLIERS
Although about 50% of the area used in [3] has been saved by [4], the throughput rate has not
been increased. On the contrary, it has decreased as a multiplexer was added to the structure
making the clock period of the multiplier in [4] equivalent to the delay of a multiplexer, an AND
gate, a 5 to 3 counter and a latch. To remedy this problem, two new structures are proposed.
3.1. Structure I
In order to reduce the clock path as described above, the multiplication algorithm has been modified.
It generates the bit-products associated with cells 2 to n at the (n+1)
th
cycle. The scheduling of the
tasks of the first cell is kept unchanged, but the latency of the multiplier is increased to n cycles. The
multiplication scheme is shown in Table 3 for 4-bit operands. The multiplication operation can be
divided into two parts, which can easily be done by rewriting the product of the two numbers, A and
B, in the following way:
ji
ni
i
nj
j
jibaBaB*A +
−=
=
−=
=
∑ ∑+= 2
1
1
1
0
0
(6)
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 5
To keep working under the constraints of sending one bit of each operand at a time, the term a0B
in equation (6) is generated by the first cell of the multiplier during the first n cycles. At this stage,
the other cells only propagate the partial products already generated by the first cell. At the
(n+1)
th
cycle, all the operand bits have been fed and the term ji
ni
i
nj
j
ji ba +
−=
=
−=
=
∑ ∑ 2
1
1
1
0
can be generated
during the last n cycles. The clock period is equivalent to the delay of a FA, an AND gate and a
latch, as shown in Figure 5. Therefore, Structure I achieves similar speed performances when
compared with serial-parallel multipliers.
3.2. Structure II
The 5 to 3 counters have been widely used in the literature [3-6]. Basically, such a counter
reduces 5 bits of the same weight to three bits: a result-bit of the same weight as the inputs (a
weight of 2
0
), a carry-bit, which has a weight of 2
1
, and a far carry-bit that has a weight of 2
2
. It is
clear that while the sum of the inputs is up to 5, the sum of the outputs is up to 6, and as such,
two representations of the outputs are excluded. Therefore, it is clearly more appropriate to
reduce the 5 inputs to 3 outputs: a result-bit of the same weight as the inputs, and 2 carry-bits of
twice the weight of the result-bit. For this purpose, a new cell has been developed by using two
FAs as shown in Figure 6. The first FA is used to accumulate two bit products with a carry feedback.
The second FA is used to generate a result-bit from the result of the first FA, the result-bit from the
adjacent cell and a carry feedback. The two carry-bits generated by the new cell are fed back and
accumulated with the bit-products of the next cycle. The sum-bit of the first FA is registered, and thus
making the clock period equivalent to the delay of a multiplexer, an AND gate, a FA and a latch. The
multiplier structure implements directly the algorithm shown in Table 2. The multiplier requires only
n/2 cells for the multiplication of two n-bit numbers. It is also modular and needs near communication
links only.
TABLE 3: Structure I Multiplication Scheme For 4-bit Operands.
Cycle P Cell 1 Cell 2 Cell 3 Cell 4
1 a0b0
2 a0b1
3 a0b2
4 P0 a0b3
5 P1 a3b0 a2b0 a1b0
6 P2 a3b1 a2b1 a1b1
7 P3 a3b2 a2b2 a1b2
8 P4 a3b3 a2b3 a1b3
9 P5
10 P6
11 P7
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 6
result
control
0 FA FA FA FA
1
0n Shift Register
ai
bi
control
Latch
00010001
FIGURE 5: Structure I bit-bit Serial Multiplier.
ai
bi
control
Sin
Sout
1
0
FA
FA
FIGURE 6: The Basic Cell of Structure II Fully Serial Multiplier
4. AREA-TIME2
EFFICIENT BIT-BIT SERIAL MULTIPLIER
In this section, a new multiplier structure, which is capable of multiplexing two multiplication
operations into Structure I is proposed. This has the merit of doubling the throughput rate at the
expense of extra hardware consisting of 2n multiplexers and n latches. By optimizing the
multiplier for area-time
2
efficiency, the problem of lost cycles is circumvented. The lost cycles are
the cycles needed for carry propagation once the generation of the partial-products is finished.
The best structure in the literature that can multiplex two multiplication operations into the same
multiplier was described in [7]. The algorithm presented in [7] is an improvement made on the
multiplication scheme of Table 2. Starting from this multiplication scheme, it reassigns and
reschedules the partial products generated by the (n/4+1)th cell and above starting at (n+n/4)th
cycle to the cells from 1 to n/4 at the (n+n/2)th cycle, respectively. This has the effect of freeing
n/4 most significant cells at the (n+n/4)th cycle and thereafter. The multiplication scheme adopted
by [7] to achieve this operation is shown in Table 5.
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 7
Cycle P Cell 1 Cell 2
1 P0
2 P1
3 P2
4 P3
5 P4
6 P5
7 P6
8 P7
a3b3
a2b3+a3b2
a2b2
a13+a3b1a0b3+a3b0
a1b2+a2b1
a1b1
a0b0
a0b1+a1b0
a0b2+a2b0
TABLE 5: Multiplication Scheme of [7].
Although the work presented in [7] can be applied to the multiplier of Structure II, it is easier to
optimize the multiplier of Structure I for area-time
2
efficiency. One can clearly observe that the
result from the first cell is not accumulated with the bit products of the other cells until the (n+1)
th
cycle. Therefore, instead of feeding the result of the first cell to its neighbouring cell, it is delayed
by n cycles using n latches before being accumulated with the rest of the partial products, as can
be seen in Figure 7. In this way, the first cell is used for n cycles to generate the bit-products of
the first multiplication operation, then is reinitialised during one cycle before being used to
generate the bit-products of the second multiplication operation for a duration of another n cycles
and so on. The remaining cells operate almost in the same fashion. The key point is that they
generate and accumulate the bit products of the first pair of operands only when the first cell has
finished producing its bit-products. This operation lasts for n cycles before the propagation of the
partial results is switched to the multiplexers, which allows the cells to generate and accumulate
the bit-product of the second pair of data. Consequently, two multiplication operations can be
multiplexed into this multiplier every 2n cycles. The proposed architecture is depicted in Figure 6.
5. COMPARISON OF PERFORMANCE
A performance comparison of the new proposed architectures with similar structures available in
the literature [3,4] in terms of area usage and the speed of the multipliers is presented in Table 6.
In terms of speed, the multiplier of Structure I has a clock period equivalent to the delay of a FA, an
AND gate and a latch, and as such operates at faster speeds. Furthermore, the multiplier of
Structure II has a clock period of a multiplexer, an AND gate, a FA and a latch which makes it faster
than the multiplier described in [3]. In terms of area usage, the improvements introduced in [4] on the
multiplier of [3] have resulted in saving half the total number of cells. In terms of FPGA area usage
and in the case of n-bit operands, Structure I is mapped into 5n/2 slices of a virtex-4 FPGA and
Structure II uses 2n slices. The multiplier given in [3] is mapped into 5n slices while the multiplier
described in [4] requires 2n slices. These results clearly show the advantages of the new structures
in terms of both speed and area usage.
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 8
FAFA FA
ai
control
bi
1
0
1
0
1
0
1
0
1
0
1
0
FA
n Shift Register
FIGURE 7: The New Area-Time
2
Efficient Multiplier.
Multiplier in [3] Multiplier in [4] Structure I Structure II
Basic Cell
counter + 2 AND
gates+
multiplexer + 6
latches
Counter + 2 AND
gates + 6 latches
+ multiplexer
FA + AND gate +
4 latches
2 Fas + 2 AND
gates +
multiplexer + 6
latches
n-bit multiplier area
usage
n BCs
n/2 BCs + n +
latches
n BCs + n latches
n/2 BCs + n
latches
Longest path
AND gate +
counter +
multiplexer +
latch
AND gate +
counter +
multiplexer +
latch
AND gate + FA +
latch
AND gate + FA +
multiplexer +
latch
Latency 1 cycle 1 cycle n cycles 2 cycles
n-bit multiplier area
usage in FPGA
Virxtex-4
5n slices 2n slices 5n/2 slices 2n slices
TABLE 6: Performance Comparison.
Multiplier in [7] New multiplier
Basic Cell
counter + 2 AND gates+
multiplexer + 6 latches
FA + AND gate + 4 latches
n-bit multiplier area
usage
3n/4 BCs + 2n latches
≈63n gates (100%)
n BCs + 4n latches + 2n
multiplexers ≈66n gates
(104%)
Longest path
AND gate + counter +
multiplexer + latch
AND gate + FA + latch
TABLE 8: Performance Comparison of Area-Time
2
Efficient Structures.
Table 8 shows a comparison of performance in terms of hardware usage and speed between the
new area-time
2
efficient multiplier and the multiplier described in [7]. An estimation of the area
usage for both structures made on the number of gates is also shown. It is assumed that the area
of the 5 to 3 counter is equal to that of two FAs and a Half Adder, as shown in Figure 1a, which
has the same behavior as the 5 to 3 counter. The area usages of both structures are almost
similar, but the BCs and the longest path have not been changed. It is worth pointing out the
reason behind the choice of the multiplication scheme of Table 4. One may comment that a
"parallel to serial converter" added to a bit serial-parallel multiplier transforms it to a fully serial
multiplier with identical features to those of Structure I multiplier. Had this approach been
adopted, once the multiplier is optimised for area-time
2
efficiency, an extra n multiplexers would
Omar Nibouche
International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 9
have been added to the multiplier. These multiplexers are to be added in the path of the data
making its clock path equal to the delay of a multiplexer, a gated FA and a latch; making it slower
than the multiplier derived from Structure I.
6. CONCLUSIONS
In this paper, new structures for reduced pin count multiplication architectures have been presented.
These multipliers are systolic and scalable, thus suitable for VLSI implementation. They are both
modular and need near communication links only. Structure I is the first bit-bit serial multiplier with
speed performances similar to existing serial-parallel multipliers. In Structure II, the basic cell has
been modified to a more appropriate 5 to 3 counter, thus increasing the throughput rate of the
multiplier. Structure I has been optimised for Area-Time
2
efficiency, which has resulted in doubling
the throughput rate.
7. REFERENCES
[1] K.K. Parhi, "VLSI Digital Signal Processing Systems: Design and Implementation", A Wiley-
Interscience Publication, 1999.
[2] A. Aggoun, A. Farwan, M.K. Ibrahim and A.S. Ashur, “Radix-2
n
Serial-Serial Multipliers”, IEE
proc. Circuits, Devices and Systems, vol. 151, issue 6, pp. 503-509, Dec. 2004.
[3] P. Ienne, and M. Viredaz, “A bit-serial multipliers and squarers”, IEEE Trans. Computer, 1994,
43, (12), pp.1445-1450.
[4] A. Aggoun, A.S. Ashur, and M.K. Ibrahim, “Area-Time efficient serial-serial multipliers”, IEEE
International Symp. On Circuits and Systems (ISCAS), pp.V-585-588, GENEVA, May 2000.
[5] N. Strader and V. Rhyne, “A canonical bit-sequential multiplier”, IEEE Trans. Computer, 1982,
vol. 31, pp. 691-626.
[6] J. Scanlon and W. Fuchs, “'High-performance bit-serial multiplication”, in Proc. IEEE ICCD'86,
Rye Brook, NY, Oct. 1986.
[7] A.S. Ashur, “New Efficient Multiplication Structure and their Applications”. Ph.D. thesis, Dept.
of Electrical and Electronic Eng., the University of Nottingham, 1996.

More Related Content

PDF
CFA based SBOX and Modified Mixcolumn Implementation of 8 Bit Datapath for AES
PDF
Multiplier and Accumulator Using Csla
PDF
Eq36876880
PDF
Design and Implementation of High Speed Area Efficient Double Precision Float...
PDF
Implementation of Low-Complexity Redundant Multiplier Architecture for Finite...
DOCX
Graph based transistor network generation method for supergate design
DOCX
High performance pipelined architecture of elliptic curve scalar multiplicati...
PDF
DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...
CFA based SBOX and Modified Mixcolumn Implementation of 8 Bit Datapath for AES
Multiplier and Accumulator Using Csla
Eq36876880
Design and Implementation of High Speed Area Efficient Double Precision Float...
Implementation of Low-Complexity Redundant Multiplier Architecture for Finite...
Graph based transistor network generation method for supergate design
High performance pipelined architecture of elliptic curve scalar multiplicati...
DESIGN OF RADIX-8 BOOTH MULTIPLIER USING KOGGESTONE ADDER FOR HIGH SPEED ARIT...

What's hot (20)

PDF
FPGA Implementation of SubByte & Inverse SubByte for AES Algorithm
DOCX
Graph based transistor network generation method for supergate design
PDF
A High Speed Wallace Tree Multiplier Using Modified Booth Algorithm for Fast ...
PDF
Low Power Implementation of Booth’s Multiplier using Reversible Gates
PDF
Power Optimization using Reversible Gates for Booth’s Multiplier
PDF
Q045079298
DOCX
High performance nb-ldpc decoder with reduction of message exchange
DOCX
A high performance fir filter architecture for fixed and reconfigurable appli...
PDF
A comparative study of different multiplier designs
PDF
D0161926
PDF
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
PDF
Hz3115131516
PDF
Low cost reversible signed comparator
PDF
Iaetsd mac using compressor based multiplier and carry save adder
PDF
International Journal of Engineering Research and Development (IJERD)
PPT
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
PDF
Bu34437441
PDF
FPGA Implementation of Mix and Inverse Mix Column for AES Algorithm
PDF
IRJET- Radix 8 Booth Encoded Interleaved Modular Multiplication
PDF
International Journal of Engineering and Science Invention (IJESI)
FPGA Implementation of SubByte & Inverse SubByte for AES Algorithm
Graph based transistor network generation method for supergate design
A High Speed Wallace Tree Multiplier Using Modified Booth Algorithm for Fast ...
Low Power Implementation of Booth’s Multiplier using Reversible Gates
Power Optimization using Reversible Gates for Booth’s Multiplier
Q045079298
High performance nb-ldpc decoder with reduction of message exchange
A high performance fir filter architecture for fixed and reconfigurable appli...
A comparative study of different multiplier designs
D0161926
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
Hz3115131516
Low cost reversible signed comparator
Iaetsd mac using compressor based multiplier and carry save adder
International Journal of Engineering Research and Development (IJERD)
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...
Bu34437441
FPGA Implementation of Mix and Inverse Mix Column for AES Algorithm
IRJET- Radix 8 Booth Encoded Interleaved Modular Multiplication
International Journal of Engineering and Science Invention (IJESI)
Ad

Viewers also liked (13)

PPS
Aprendamos tecnologia de almacenamiento y CLOUD COMPUTING
PDF
Tafsir e-rahmatul lil aalameen
PPTX
Producción agrícola
PDF
Confident Gold Coast-bangalore5.com
PDF
Mandating Reduced Transit Fares for Low-Income Residents
DOCX
Diversity state of_infor_beliefs_rubric fall 2011
PPTX
The quopn-advantage
DOCX
Thirumeni.D-HR
PPTX
Przykład zadania Analiza Mnożnikowa
PPTX
Model liniowy Holta
PDF
Apostila curso de blaster
PPTX
Golder - Site Remediation & LSRP Expertise
PPTX
Disney Sustainability
Aprendamos tecnologia de almacenamiento y CLOUD COMPUTING
Tafsir e-rahmatul lil aalameen
Producción agrícola
Confident Gold Coast-bangalore5.com
Mandating Reduced Transit Fares for Low-Income Residents
Diversity state of_infor_beliefs_rubric fall 2011
The quopn-advantage
Thirumeni.D-HR
Przykład zadania Analiza Mnożnikowa
Model liniowy Holta
Apostila curso de blaster
Golder - Site Remediation & LSRP Expertise
Disney Sustainability
Ad

Similar to Area Efficient and Reduced Pin Count Multipliers (20)

PDF
Implementation of Low Power and Area Efficient Carry Select Adder
PDF
Compressor based approximate multiplier architectures for media processing ap...
PDF
F010113644
DOCX
Modified booth
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Area Delay Power Efficient and Implementation of Modified Square-Root Carry S...
PDF
Designing a Novel High Performance Four-to-Two Compressor Cell Based on CNTFE...
PDF
Design and Verification of Area Efficient Carry Select Adder
DOC
Survey on Prefix adders
PDF
A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...
PDF
Design and Implementation of Low-Power and Area-Efficient 64 bit CSLA using VHDL
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
PDF
Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...
PDF
Iaetsd 128-bit area
PDF
Design and Implementation of Optimized 32-Bit Reversible Arithmetic Logic Unit
PDF
SFQ MULTIPLIER
PDF
Feasible methodology for
PDF
FPGA Implementation of High Speed Architecture of CSLA using D-Latches
Implementation of Low Power and Area Efficient Carry Select Adder
Compressor based approximate multiplier architectures for media processing ap...
F010113644
Modified booth
International Journal of Engineering Research and Development (IJERD)
Area Delay Power Efficient and Implementation of Modified Square-Root Carry S...
Designing a Novel High Performance Four-to-Two Compressor Cell Based on CNTFE...
Design and Verification of Area Efficient Carry Select Adder
Survey on Prefix adders
A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier us...
Design and Implementation of Low-Power and Area-Efficient 64 bit CSLA using VHDL
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
International Journal of Engineering and Science Invention (IJESI)
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...
Comparative Study of Low Power Low Area Bypass Multipliers for Signal Process...
Iaetsd 128-bit area
Design and Implementation of Optimized 32-Bit Reversible Arithmetic Logic Unit
SFQ MULTIPLIER
Feasible methodology for
FPGA Implementation of High Speed Architecture of CSLA using D-Latches

Recently uploaded (20)

PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Hazard Identification & Risk Assessment .pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
1_English_Language_Set_2.pdf probationary
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Empowerment Technology for Senior High School Guide
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
Computing-Curriculum for Schools in Ghana
PDF
Trump Administration's workforce development strategy
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PPTX
Introduction to Building Materials
PDF
My India Quiz Book_20210205121199924.pdf
PPTX
Virtual and Augmented Reality in Current Scenario
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
Introduction to pro and eukaryotes and differences.pptx
Hazard Identification & Risk Assessment .pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
1_English_Language_Set_2.pdf probationary
Indian roads congress 037 - 2012 Flexible pavement
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Empowerment Technology for Senior High School Guide
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
TNA_Presentation-1-Final(SAVE)) (1).pptx
Computing-Curriculum for Schools in Ghana
Trump Administration's workforce development strategy
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
Share_Module_2_Power_conflict_and_negotiation.pptx
Introduction to Building Materials
My India Quiz Book_20210205121199924.pdf
Virtual and Augmented Reality in Current Scenario
202450812 BayCHI UCSC-SV 20250812 v17.pptx
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf

Area Efficient and Reduced Pin Count Multipliers

  • 1. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 1 Area Efficient and Reduced Pin Count Multipliers Omar Nibouche o.nibouche@tu.edu.sa College of Computers and Information Technology Taif University POB 888 Taif 21964, KSA Abstract Fully serial multipliers can play an important role in the implementation of DSP algorithms in resource-limited chips such as FPGAs; offering area efficient architectures with a reduced pin count and moderate throughput rates. In this paper two structures that implement the fully serial multiplication operation are presented. One significant aspect of the new designs is that they are systolic and require near communication links only. They are superior in speed and area usage to similar architectures in the literature. The paper also present a new fully serial multiplier optimized for area-time 2 efficiency with better performance than available architectures in the open literature. Keywords: Educed Pin Count, Serial Multiplication, Area-Time 2 . 1. INTRODUCTION There is a crucial advantage offered by bit-serial processors over their parallel counterpart, which lies in the very efficient use of chip area. They are particularly suitable for applications that require slow to moderate speeds and in batch mode applications. By contrast, bit-parallel processors are useful for fast speed systems, but at the expense of larger a area usage and thus they are more expensive [1-2]. Traditional bit serial multiplier structures suffer from an inefficient generation of partial products, which leads to hardware overuse and slow speed systems. In this paper, two structures for bit serial multiplication are presented. The first structure, called structure I, is the first fully serial multiplier reported in the literature with comparable performance - in terms of speed- to existing serial-parallel multipliers. The second structure, termed structure II, requires an extra multiplexer in the clock path; thus making it slower, but has the merit of reducing the latency of the multiplier. The remainder of the paper is organised as follows: in section 2, the previous work in the literature is reviewed, while section 3 describes the new structures for fully serial multiplication. Section 4 is concerned with an optimisation of the multiplier of Structure I in terms of area-time 2 efficiency. A comparison of performance is shown in section 5 and conclusions are given in section 6. 2. BIT SERIAL MULTIPLIERS: A REVIEW One of the early bit serial multipliers was proposed by [3]. It generates the partial products in a recursive fashion. Consider the multiplication of two n-bit positive numbers A and B as follows: ∑ −= = = 1 0 2 ni i i iaA (1) and ∑ −= = = 1 0 2 ni i i ibB (2)
  • 2. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 2 Let Pi represents the partial product computed after the i th bit is fed [3]. Pi is given by: i 2 111 22 ba)AbB(aPP i i i-ii-i i i-i +++= (3) where Ai and Bi represent the value of the operands A and B, respectively, and by considering only bits from the Least Significant Bit (LSB) to the i th bit, that is, Ai = A mod 2 i+1 (4) and Bi = B mod 2 i+1 (5) with the initial values P-1 = A-1 = B-1 = 0. The generation of the partial product, using equation (3), and their assignment to the multiplier cells is shown in Table 1 below: Cycle P Cell 1 Cell 2 1 2 3 4 5 6 7 8 a3 b3 Cell 3 Cell 4 a2 b3 +a3 b2 P0 a2b2 a1 b1 a0 b0 a1b2+a2b1 a0 b1 +a1 b0 a0 b2 +a2 b0 a1 b3 +a3 b1 a0 b3 +a3 b3 P1 P2 P3 P4 P5 P6 P7 TABLE 1: Multiplication Scheme of [3]. From the above table, each cell generates two new bit-products every cycle. To cope with this constraint, The Basic Cell (BC) of the multiplier proposed by [3] is built around a 5 to 3 counter. The counter is capable of accumulating five inputs of the same weight to a sum-bit (Sout) of the same weight as the inputs, i.e.2 0 , 1 st carry-bit (C 1 out) of a weight of 2 1 and 2 nd carry-bit (C 2 out) of a weight of 2 2 . In particular, the sum-bit is calculated through a tree of EXOR gate to reduce the propagation delay within the cell. The BC of the multiplier is shown in Figure 1. The multiplier uses n identical cells to perform the multiplication of two n-bit numbers in 2n cycles, as it can be seen in Figure 2.
  • 3. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 3 SoutC1 out C2 out Cin Sin ai bi control 5to3 counter Sout Cout Latch FA FA HA 22 21 20 20 2020 20 20 (b) (a) FIGURE 1 (a): 5 to 3 Counter Made of Two FAs and One HA (b). The BC of [3]. BC n BC 2 BC 1 ai Si 0 control bi FIGURE 2: The n-bit Multiplier by [3]. In [4], modifications were carried out on the multiplication scheme of Table 1 to make a more efficient use of the hardware. In fact, the multiplier proposed by [4] uses only half the number of cells required by [3]. To achieve this, the partial products generated by the last n/2 cells in [3] were reallocated to the first n/2 cells and rescheduled at cycle n+1. In this way, a full utilisation of cells 1 to n/2 can be achieved, as it can be shown in Table 2. Cycle P Cell 1 Cell 2 1 2 3 4 5 6 7 8 a3 b3 a2b3+a3b2 P0 a2 b2 a1 b1 a0b0 a1b2+a2b1 a0b1+a1b0 a0b2+a2b0 a1 b3 +a3 b1 a0 b3 +a3 b3 P1 P2 P3 P4 P5 P6 P7 TABLE 2: Multiplication Scheme by [4].
  • 4. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 4 The multiplier was modified using little extra hardware: two n/2 shift registers are used to store the n/2 most significant bits of the data words A and B and n/2 multiplexers. The BC of the modified multiplier d is shown in Figure 3. The multiplier structure as proposed by [4] is shown in Figure 4. This architecture has reduced the number of 5 to 3 counters by 50%, as well as reducing the number of latches by 33%. The clock path equals the delay of a multiplexer, a gated counter and a latch. 1 0 5to3 counter Sout C1 out C2 out Cin Sin ai bi control Cout Sout Latch FIGURE 3: The BC of [4]. BC N/2 BC 2 BC 1 1 0 ai Si 0 control 1 0 n/2 Shift Register n/2 Shift Register bi FIGURE 4: The n-bit Multiplier by [4]. 3. THE NEW FULLY SERIAL MULTIPLIERS Although about 50% of the area used in [3] has been saved by [4], the throughput rate has not been increased. On the contrary, it has decreased as a multiplexer was added to the structure making the clock period of the multiplier in [4] equivalent to the delay of a multiplexer, an AND gate, a 5 to 3 counter and a latch. To remedy this problem, two new structures are proposed. 3.1. Structure I In order to reduce the clock path as described above, the multiplication algorithm has been modified. It generates the bit-products associated with cells 2 to n at the (n+1) th cycle. The scheduling of the tasks of the first cell is kept unchanged, but the latency of the multiplier is increased to n cycles. The multiplication scheme is shown in Table 3 for 4-bit operands. The multiplication operation can be divided into two parts, which can easily be done by rewriting the product of the two numbers, A and B, in the following way: ji ni i nj j jibaBaB*A + −= = −= = ∑ ∑+= 2 1 1 1 0 0 (6)
  • 5. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 5 To keep working under the constraints of sending one bit of each operand at a time, the term a0B in equation (6) is generated by the first cell of the multiplier during the first n cycles. At this stage, the other cells only propagate the partial products already generated by the first cell. At the (n+1) th cycle, all the operand bits have been fed and the term ji ni i nj j ji ba + −= = −= = ∑ ∑ 2 1 1 1 0 can be generated during the last n cycles. The clock period is equivalent to the delay of a FA, an AND gate and a latch, as shown in Figure 5. Therefore, Structure I achieves similar speed performances when compared with serial-parallel multipliers. 3.2. Structure II The 5 to 3 counters have been widely used in the literature [3-6]. Basically, such a counter reduces 5 bits of the same weight to three bits: a result-bit of the same weight as the inputs (a weight of 2 0 ), a carry-bit, which has a weight of 2 1 , and a far carry-bit that has a weight of 2 2 . It is clear that while the sum of the inputs is up to 5, the sum of the outputs is up to 6, and as such, two representations of the outputs are excluded. Therefore, it is clearly more appropriate to reduce the 5 inputs to 3 outputs: a result-bit of the same weight as the inputs, and 2 carry-bits of twice the weight of the result-bit. For this purpose, a new cell has been developed by using two FAs as shown in Figure 6. The first FA is used to accumulate two bit products with a carry feedback. The second FA is used to generate a result-bit from the result of the first FA, the result-bit from the adjacent cell and a carry feedback. The two carry-bits generated by the new cell are fed back and accumulated with the bit-products of the next cycle. The sum-bit of the first FA is registered, and thus making the clock period equivalent to the delay of a multiplexer, an AND gate, a FA and a latch. The multiplier structure implements directly the algorithm shown in Table 2. The multiplier requires only n/2 cells for the multiplication of two n-bit numbers. It is also modular and needs near communication links only. TABLE 3: Structure I Multiplication Scheme For 4-bit Operands. Cycle P Cell 1 Cell 2 Cell 3 Cell 4 1 a0b0 2 a0b1 3 a0b2 4 P0 a0b3 5 P1 a3b0 a2b0 a1b0 6 P2 a3b1 a2b1 a1b1 7 P3 a3b2 a2b2 a1b2 8 P4 a3b3 a2b3 a1b3 9 P5 10 P6 11 P7
  • 6. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 6 result control 0 FA FA FA FA 1 0n Shift Register ai bi control Latch 00010001 FIGURE 5: Structure I bit-bit Serial Multiplier. ai bi control Sin Sout 1 0 FA FA FIGURE 6: The Basic Cell of Structure II Fully Serial Multiplier 4. AREA-TIME2 EFFICIENT BIT-BIT SERIAL MULTIPLIER In this section, a new multiplier structure, which is capable of multiplexing two multiplication operations into Structure I is proposed. This has the merit of doubling the throughput rate at the expense of extra hardware consisting of 2n multiplexers and n latches. By optimizing the multiplier for area-time 2 efficiency, the problem of lost cycles is circumvented. The lost cycles are the cycles needed for carry propagation once the generation of the partial-products is finished. The best structure in the literature that can multiplex two multiplication operations into the same multiplier was described in [7]. The algorithm presented in [7] is an improvement made on the multiplication scheme of Table 2. Starting from this multiplication scheme, it reassigns and reschedules the partial products generated by the (n/4+1)th cell and above starting at (n+n/4)th cycle to the cells from 1 to n/4 at the (n+n/2)th cycle, respectively. This has the effect of freeing n/4 most significant cells at the (n+n/4)th cycle and thereafter. The multiplication scheme adopted by [7] to achieve this operation is shown in Table 5.
  • 7. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 7 Cycle P Cell 1 Cell 2 1 P0 2 P1 3 P2 4 P3 5 P4 6 P5 7 P6 8 P7 a3b3 a2b3+a3b2 a2b2 a13+a3b1a0b3+a3b0 a1b2+a2b1 a1b1 a0b0 a0b1+a1b0 a0b2+a2b0 TABLE 5: Multiplication Scheme of [7]. Although the work presented in [7] can be applied to the multiplier of Structure II, it is easier to optimize the multiplier of Structure I for area-time 2 efficiency. One can clearly observe that the result from the first cell is not accumulated with the bit products of the other cells until the (n+1) th cycle. Therefore, instead of feeding the result of the first cell to its neighbouring cell, it is delayed by n cycles using n latches before being accumulated with the rest of the partial products, as can be seen in Figure 7. In this way, the first cell is used for n cycles to generate the bit-products of the first multiplication operation, then is reinitialised during one cycle before being used to generate the bit-products of the second multiplication operation for a duration of another n cycles and so on. The remaining cells operate almost in the same fashion. The key point is that they generate and accumulate the bit products of the first pair of operands only when the first cell has finished producing its bit-products. This operation lasts for n cycles before the propagation of the partial results is switched to the multiplexers, which allows the cells to generate and accumulate the bit-product of the second pair of data. Consequently, two multiplication operations can be multiplexed into this multiplier every 2n cycles. The proposed architecture is depicted in Figure 6. 5. COMPARISON OF PERFORMANCE A performance comparison of the new proposed architectures with similar structures available in the literature [3,4] in terms of area usage and the speed of the multipliers is presented in Table 6. In terms of speed, the multiplier of Structure I has a clock period equivalent to the delay of a FA, an AND gate and a latch, and as such operates at faster speeds. Furthermore, the multiplier of Structure II has a clock period of a multiplexer, an AND gate, a FA and a latch which makes it faster than the multiplier described in [3]. In terms of area usage, the improvements introduced in [4] on the multiplier of [3] have resulted in saving half the total number of cells. In terms of FPGA area usage and in the case of n-bit operands, Structure I is mapped into 5n/2 slices of a virtex-4 FPGA and Structure II uses 2n slices. The multiplier given in [3] is mapped into 5n slices while the multiplier described in [4] requires 2n slices. These results clearly show the advantages of the new structures in terms of both speed and area usage.
  • 8. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 8 FAFA FA ai control bi 1 0 1 0 1 0 1 0 1 0 1 0 FA n Shift Register FIGURE 7: The New Area-Time 2 Efficient Multiplier. Multiplier in [3] Multiplier in [4] Structure I Structure II Basic Cell counter + 2 AND gates+ multiplexer + 6 latches Counter + 2 AND gates + 6 latches + multiplexer FA + AND gate + 4 latches 2 Fas + 2 AND gates + multiplexer + 6 latches n-bit multiplier area usage n BCs n/2 BCs + n + latches n BCs + n latches n/2 BCs + n latches Longest path AND gate + counter + multiplexer + latch AND gate + counter + multiplexer + latch AND gate + FA + latch AND gate + FA + multiplexer + latch Latency 1 cycle 1 cycle n cycles 2 cycles n-bit multiplier area usage in FPGA Virxtex-4 5n slices 2n slices 5n/2 slices 2n slices TABLE 6: Performance Comparison. Multiplier in [7] New multiplier Basic Cell counter + 2 AND gates+ multiplexer + 6 latches FA + AND gate + 4 latches n-bit multiplier area usage 3n/4 BCs + 2n latches ≈63n gates (100%) n BCs + 4n latches + 2n multiplexers ≈66n gates (104%) Longest path AND gate + counter + multiplexer + latch AND gate + FA + latch TABLE 8: Performance Comparison of Area-Time 2 Efficient Structures. Table 8 shows a comparison of performance in terms of hardware usage and speed between the new area-time 2 efficient multiplier and the multiplier described in [7]. An estimation of the area usage for both structures made on the number of gates is also shown. It is assumed that the area of the 5 to 3 counter is equal to that of two FAs and a Half Adder, as shown in Figure 1a, which has the same behavior as the 5 to 3 counter. The area usages of both structures are almost similar, but the BCs and the longest path have not been changed. It is worth pointing out the reason behind the choice of the multiplication scheme of Table 4. One may comment that a "parallel to serial converter" added to a bit serial-parallel multiplier transforms it to a fully serial multiplier with identical features to those of Structure I multiplier. Had this approach been adopted, once the multiplier is optimised for area-time 2 efficiency, an extra n multiplexers would
  • 9. Omar Nibouche International Journal of Engineering (IJE), Volume (7) : Issue (1) : 2013 9 have been added to the multiplier. These multiplexers are to be added in the path of the data making its clock path equal to the delay of a multiplexer, a gated FA and a latch; making it slower than the multiplier derived from Structure I. 6. CONCLUSIONS In this paper, new structures for reduced pin count multiplication architectures have been presented. These multipliers are systolic and scalable, thus suitable for VLSI implementation. They are both modular and need near communication links only. Structure I is the first bit-bit serial multiplier with speed performances similar to existing serial-parallel multipliers. In Structure II, the basic cell has been modified to a more appropriate 5 to 3 counter, thus increasing the throughput rate of the multiplier. Structure I has been optimised for Area-Time 2 efficiency, which has resulted in doubling the throughput rate. 7. REFERENCES [1] K.K. Parhi, "VLSI Digital Signal Processing Systems: Design and Implementation", A Wiley- Interscience Publication, 1999. [2] A. Aggoun, A. Farwan, M.K. Ibrahim and A.S. Ashur, “Radix-2 n Serial-Serial Multipliers”, IEE proc. Circuits, Devices and Systems, vol. 151, issue 6, pp. 503-509, Dec. 2004. [3] P. Ienne, and M. Viredaz, “A bit-serial multipliers and squarers”, IEEE Trans. Computer, 1994, 43, (12), pp.1445-1450. [4] A. Aggoun, A.S. Ashur, and M.K. Ibrahim, “Area-Time efficient serial-serial multipliers”, IEEE International Symp. On Circuits and Systems (ISCAS), pp.V-585-588, GENEVA, May 2000. [5] N. Strader and V. Rhyne, “A canonical bit-sequential multiplier”, IEEE Trans. Computer, 1982, vol. 31, pp. 691-626. [6] J. Scanlon and W. Fuchs, “'High-performance bit-serial multiplication”, in Proc. IEEE ICCD'86, Rye Brook, NY, Oct. 1986. [7] A.S. Ashur, “New Efficient Multiplication Structure and their Applications”. Ph.D. thesis, Dept. of Electrical and Electronic Eng., the University of Nottingham, 1996.