38 116-1-pb

ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 2, April 2012

Tripartite Modular Multiplication using
Toom-Cook Multiplication
Amar Mandal, Rupali Syal

 compute modular multiplication these method also execution
Abstract— Modular multiplication is the fundamental in parallel way.
operation in most public-key cryptosystem. Therefore, the
efficiency of modular multiplication directly affects the
The proposed modular multiplication algorithm that
efficiency of whole crypto-system. This paper presents an efficiently integrates three existing algorithms, Barrett
efficient modular multiplication algorithm for large integer. modular multiplication, Montgomery modular multiplication
The proposed algorithm integrates with three existing and Toom-Cook multiplication, this proposed algorithm
algorithm, Barrett Algorithm and Montgomery algorithm for divide into two step multiplication and modular
modular multiplication, Toom-Cook algorithm for multiplication step. Multiplication step Toom-cook algorithm
multiplication. This algorithm execution done in parallel way so
that enhance the performance. These algorithms Analysis with
is used. Modular multiplication step Barrett and Montgomery
respect to their performance and compare to other modular algorithms are used in parallel way. The proposed algorithm
multiplication algorithms. minimizes the number of single-precision multiplication
Index Terms— Barrett algorithm, Bipartite modular enable more than three way parallel computation.
multiplication, Karatsuba multiplication algorithm, The remainder of this paper is structured as follows.
Montgomery algorithm, Toom-Cook multiplication, Tripartite Section 2 describes Barrett algorithm, Montgomery
modular multiplication. algorithm, Bipartite algorithm and Tripartite algorithm. In
Section 3, our proposed algorithm is introduced. Software
implementation results are introduced in Section 4. Section 5
I. INTRODUCTION concludes the paper.
Public Key Cryptography (PKC) introduced by Diffie and
Hellman [1] in the mid-1970s. Many cryptographic protocols,
such as the RSA scheme [2], ElGamal [3], Diffie-Hellman key II. RELATED WORK
exchange, and DSA [4], are based on modular arithmetic
operations. These algorithms for modular multiplication are described
for use with large nonnegative integers expressed in radix b
The efficiency of a particular cryptosystem will depend on notation, where b can be any integer ≥ 2. Given a modulus M
a number of factors, such as parameter size, time-memory and two elements X, Y ∈ ZM where ZM is the ring of integers
tradeoffs, available processing power, parallel computing, modulo M we define the ordinary modular multiplication as
software and/or hardware optimization, and mathematical XY mod M
algorithms. An efficient implementation of this operation is Mathematical representation of X, Y and M is inputs of
the key to high performance. A basic operation in public key
modular multiplication algorithms.
cryptosystems is the modular multiplication of large numbers.
k 1
M i 0 mibi 0<mk-1 < b and 0 ≤ mi <b, for i=0,1,…..,k-1
This paper deals with different modular multiplication
X  i 0 xibi 0<xk-1 < b and 0 ≤ xi <b, for i=0,1,…..,k-1
algorithms namely Barrett algorithm [11], Montgomery k 1

algorithm [9], Bipartite algorithm [5], Tripartite algorithm
Y  i 0 yibi 0<yk-1 < b and 0 ≤ yi <b, for i=0,1,…..,k-1
k 1
[19] and purposed algorithm. Barrett and Montgomery
algorithms are widely used today. Barrett algorithm output in
this algorithm is (X.Y)modM and this algorithm required
preprocessed value. Montgomery modular multiplication These algorithms for performing the modular multiplication
algorithm output is (X.Y)R-1modM and also required and analyze their time and space requirements. . The analysis
preprocessed value . Bipartite modular multiplication is performed by counting the total number of multiplications,
integrates Barrett and Montgomery method these methods additions, subtractions, and memories read and write
execution in parallel way. Tripartite modular multiplication operations in terms of the input size parameter k. They are
use Karatsuba multiplication for multiplication of two large counted to calculate the proportion of the memory access
number and two efficient Barrett and Montgomery algorithms time in the total running time of the modular multiplication
algorithm. In our analysis, loop establishment and index
.
computations are not taken into account. The space analysis
Amar Mandal, Department of Computer Science and Engineering, PEC is performed by counting the total number of words used as
University of Technology, Chandigarh, India, the temporary space. However, the space required keeping
the input and output values.
Rupali Syal, Department of Information Technology, PEC University of
Technology, Chandigarh, India,

100
All Rights Reserved © 2012 IJARCSEE

ISSN: 2277 – 9043

A. BARRETT MODULAR MULTIPLICATION Montgomery reduction is isomorphic to the ordinary modular
multiplication. The rationale behind the m-residue
P. Barrett [11] introduced the idea of estimating the quotient transformation is the ability to perform a Montgomery
S/M, S=XY with operations that either are less expensive in reduction (XR−1)modM for 0 ≤ X<RM in almost the same
time than a multiprecision division by M (viz., 2 divisions by time as a multiplication. In this algorithm required one pre
a power of band a partial multiprecision multiplication), or compute value M’=-M-1
can be done as a pre calculation for a given m (viz., U = b2k /
M, i.e., U is a scaled estimate of the modulus’ reciprocal). MONTGOMERY MODULAR MULTIPLICATION
The estimate q of S/M is obtained by replacing the floating ALGORITHAM

point divisions in q  
  
 S / b 2 k t b 2 k / M  Input: X=(x[k-1],x[-2],..x[1],x[0])b ,
Y=(y[k-1],y[-2],..y[1],y[0])b,
 by integer
 bt  M=(m[k-1],m[-2],..m[1],m[0])b ,
M’=(m’[k-1],m’[-2],..m’[1],m’[0])b, b≥2
divisions q   
 S / b b / M 
2k t 2k
Output: XYR-1 mod M
ˆ .
 bt  1. S=XY;
2. for (i = 0; i < k; i++) do {
This estimate will never be too large and, if k<t≤2k, the error
3. ti= (Si M’0) mod b;
ˆ
is at most two: S/M−2 ≤ q ≤S/M, for k<t≤2k. 4. S = S + tiMbi;
The best choice for t, resulting in the least single precision 5. }
multiplications and the smallest maximal error, is k+1, which 6. S = Sdiv bk;
ˆ
also was Barrett’s original choice. An estimate r for S mod M 7. if (S ≥ M) then
is then given by r=x-qm, or, as r < bk+1 (if b>2), by 8. S = S − M;
ˆ
r =((S)mod bk+1 −(qm)mod bk+1)mod bk+1, which means that
once again only a partial multiprecision multiplication is This algorithm (with the slight improvement above) requires
needed. At most two further subtractions of mare required to 2k2+k multiplications, 4k2+4k+2 additions, 6k2+7k+2 reads,
obtain the correct remainder. and 2k2+5k+1 writes, including the final multi-precision
subtraction, and uses k + 3 words of memory space.
The Montgomery representation of an integer X, denoted by
BARRETT MODULAR MULTIPLICATION XMont, can be computed by performing a Montgomery
ALGORITHAM multiplication on X and R2, denoted by MontM(X,R2),
Input: X=(x[k-1],x[-2],..x[1],x[0])b, resulting in XMont = MontM(X,R2) = (X·R2 ·R−1) mod M =
Y=(y[k-1],y[-2],..y[1],y[0])b, (X·R) mod M. After computing the Montgomery
M=(m[k-1],m[-2],..m[1],m[0])b , multiplication of two operands in Montgomery
U=(u[k-1],u[-2],..u[1],u[0])b , b≥2 representation, the result is also in Montgomery
Output: XY mod M representation and can be converted back by multiplication
1. S=XY; with R−1, which comes down to Montgomery multiplication
2. q = ((S div bk−1)U) div bk+1; with 1.Computation of the result:
T = MontM(XMont, Y ) = (X · R · Y · R−1) mod M
3. S = Smod bk+1 − (qM) mod bk+1;
= (X · Y ) mod M .
4. if (S < 0) then
This means that two Montgomery multiplications are needed
5. S = S + bk+1;
for one modular multiplication. That is why the use of
6. while (S ≥ M) do
Montgomery multiplication is only interesting when many
7. S = S − M;
consecutive modular multiplications need to be performed.
This Algorithm requires 3k2 multiplications, 6k2+k+1
C. BIPARTITE MODULAR MULTIPLICATION
additions, 9k2+2k+2 reads, 3k2+4k writes and uses
2k+1words of memory space. In Bipartite Modular multiplication both Barrett and
Montgomery algorithms are used in this algorithm X is
dividing in to two parts upper parts calculate using
B. MONTGOMERY MODULAR MULTIPLICATION Montgomery algorithm and lower part calculate using Barrett
The Let R>M be an integer relatively prime to M such that algorithm .The bipartite algorithm was introduced for the
computations modulo R are easy to process: R = bk . Notice purpose of a two-way parallel computation [6]. It uses two
that the condition gcd(M, b)=1 means that this method cannot custom modular multipliers, a Barrett modular multiplier and
be used for all moduli. In case b is a power of 2, it simply a Montgomery multiplier, in order to improve the speed. By
means that m should be odd. The m-residue with respect to R combining a Barrett modular multiplication with
of an integer X<M is defined as XRmod M. The set {XR mod Montgomery modular multiplication, it splits the operand
M | 0 ≤ x<M} clearly forms a complete residue system. The multiplier into two parts and processes them in parallel,
Montgomery reduction of X is defined as XR−1 mod M, increasing the calculation speed. Parallel execution of this
where R−1 is the inverse of R modulo m, and it is the inverse method with the help of fork() system call in Linux operating
operation of the m-residue transformation. It can be shown system. The calculation is performed using Montgomery
that the multiplication of two m-residues followed by residues defined by a modulus M and a Montgomery radix R,

101

ISSN: 2277 – 9043

R < M. Next, we outline the main idea of the bipartite 11/4k2 +k multiplications, 11/2k2+21/2k+8 additions,
algorithm. Let R = bl for some 0 < l < k. Consider the 31/4k2+16k+10 reads, and 3/2k2+25/2k+6 writes 7k+5
multiplier Y to be split into two parts Y1 and Y0 so that Y = subtraction and first multiplication step required 3/4k2
Y1R + Y0. Then, the Montgomery multiplication modulo M multiplications, k2+2k additions, 9/4k2+4k reads, and
of the integers X and Y can be computed as follows: 3/2k2+7/2k-4 writes k subtraction. The all above algorithms
XYR-1 mod M they are slightly more number of operations in read, write,
= X(Y1R + Y0)R-1 mod M multiplication, subtraction and addition but this algorithm
=((XY1 mod M) + (XY0R-1 mod M))mod M computes parallel way first and second parts using Barrett
= BarrettM(X,Y) +MontM(X,Y) algorithm third term using Montgomery algorithm, modular
let l =(k/2) than This algorithm (with the slight improvement multiplication step of This algorithm execution in parallel
above) requires (5/2k2 ) multiplications, (5k2+5/2k+1) way so that time consuming is less than other algorithm.
additions, (15/2k2+19/2k+5) reads, and (5/2k2+17/2k+3)
writes 5k+3 subtraction, and uses 2k + 1 words of memory
space. In this algorithm use both Montgomery and Barrett III. THE PROPOSED MODULAR MULTIPLICATION
methods execution in parallel way so that enhance the speed.
D. TRIPARTITE MODULAR MULTIPLICATION The proposed modular multiplication algorithm divides
Tripartite modular multiplication algorithm divides into two into two step multiplication and modular multiplication.
step multiplication and modular multiplication, first Multiplication step using Toom-Cook algorithm and split in
multiplication step using Karatsuba algorithm and split in five parts and in modular multiplication step compute of
three parts and second part modular multiplication part these five parts by using Barrett and Montgomery modular
execution of these three parts by using Barrett and multiplication in parallel way.
Montgomery modular multiplication in parallel way. Multiplication step computes by Toom-Cook
The first multiplication step computes by Karatsuba multiplication algorithm. Given two large integers, X and Y,
algorithm. The Karatsuba algorithm is a fast multiplication Toom–Cook splits up X and Y into t smaller parts each of
algorithm. It reduces the multiplication of two k-digit length l, and performs operations on the parts, Toom-3 is
numbers to at most 3k log2 3  3k 1.585 single-digit only a single instance of the Toom–Cook algorithm,
log 3
multiplications in general (and exactly k 2 when k is a where t = 3.Toom-3 reduces 9 multiplications to 5, and runs
in Θ(nlog(5)/log(3)), about Θ(n1.465). In general, Toom-t runs in
power of 2). It is therefore faster than the classical algorithm,
which requires k2 single-digit products. Θ(c(t) ne), where e = log(2t − 1) / log(t), ne is the time spent
The basic step of Karatsuba's algorithm is a formula that on sub-multiplications, and c is the time spent on additions
allows us to compute the product of two large numbers X and multiplication by small constants. The Karatsuba
and Y using three multiplications of smaller numbers, each algorithm is a special case of Toom–Cook, where the number
with about half as many digits as X or Y, plus some additions is split into two smaller ones. It reduces 4 multiplications to 3
and digit shifts. Let X and Y are represented as n-digit and so operates at Θ(nlog(3)/log(2)), which is about Θ(n1.585).
strings in some base B. For any positive integer l less than k, Ordinary long multiplication is equivalent to Toom-1, with
one can split the two given numbers as follows complexity Θ(n2).
R=Bl In a typical large integer implementation, each integer is
X=X1R+X0 represented as a sequence of digits in positional notation,
Y=Y1R+Y0 with the base or radix set to some (typically large) value b, (in
Where X0 and Y0 are less than R. The product is then
a computer implementation, b would typically be a power of
XY=( X1R+X0)( Y1R+Y0)
=Z2R2 +Z1R+Z0 2 instead). Say the two integers being multiplied are: The
Where base B = bi, such that the number of digits of both m and n in
Z2=X1Y1 base B is at most t (e.g., 3 in Toom-3). Then
Z0=X0Y0 separate m and n into their base B digits mi, ni: Then use
Z1=X1Y0+X0Y1=(X1+X0) (Y1+Y0)-Z2-Z0 these digits as coefficients in degree t−1
Karatsuba observed that XY can be computed in only three polynomials p and q, with the property that p(B)
multiplications, and few extra additions: = m and q(B) = n:
First step multiplication of number and splitting three parts p(x)=m2x2+m1x+m0
Z0, Z1, Z2 Modular multiplication step compute this three q(x)=n2x2+n1x+n0
parts as follows.
(XYR-1)mod M
The purpose of defining these polynomials is that if
= (Z2R2 +Z1R+Z0)R-1 mod M
=(Z2R)mod M +Z1mod M +Z0R-1 mod M compute their product r(x) = p(x)q(x), our answer will
=(Z2R)mod M +Z1mod M+X0Y0R-1 mod M be r(B) = m×n. In the case where the numbers being
=BarrettM(Z2,R) + BarrettM(Z1,1) multiplied are of different sizes, it's useful to use different
+MontM(X0,Y0) values of t for m and n, which we'll call tm and tn. The
To obtain a high-speed implementation, one can number of elementary operations (addition/subtraction) can
compute these three different terms in parallel. We take l=k/2 be reduced. Executed here over the first operand
for calculation. Modular multiplication step two Barrett (polynomial p) of the running example is the following:
methods and one Montgomery requires. this step requires p0=m0+m2

102

ISSN: 2277 – 9043

p(0)=m0
p(1)=p0+m1
p(-1)=p0+m1 IV. RESULT
p(-2)= (p(−1) + m2)×2 − m0 Use After Software performance; Execution times for the
p(∞)=m2 modular multiplication of a 2k-digit number modulo a k-digit
This sequence requires five addition/subtraction modulus M for the five modular multiplication algorithms
operations, one less than the straightforward evaluation. In compared to the execution time of a k × k-digit multiplication
practical implementations, as the operands become smaller, (b = 23, on a 1.73 GHz Intel Celeron R based PC with gcc
the algorithm will switch to the Schoolbook long 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)).
multiplication. Letting r be the product polynomial:
RESULT
Lenth Times in milliseconds
of M in
r(0)=p(0)q(0) bits
Barre Montgomery Bipartite Tripartite Proposed
tt (Karatsuba) Algorithm
r(1)=p(1)q(1) 128 8 9 22 30 47
r(-1)=p(-1)q(-1) 256 42 50 11 50 52
r(-2)=p(-2)q(-2) 512 61 59 20 58 58
r(∞)=p(∞)q(∞) 1024 94 105 89 60 74
A difficult design challenge in Toom–Cook is to find an 2048 173 143 153 120 91
efficient sequence of operations to compute this product; one
sequence given by Bodrato[14] for Toom-3 is the following.
r0=r(0) These observations are confirmed by a software
r4=r(∞) implementation of these algorithms, see in Table. The
r3=(r(−2) − r(1))/3 implementation is written in ANSI C [4] and hence should be
r1=(r(1) − r(−1))/2 portable to any computer for which an implementation of the
r2=r(−1) − r(0) ANSI C standard exists. All figures in this article are
r3=(r2 − r3)/2 + 2r(∞) obtained on a 1.73 GHz Intel Celeron R based PC using the
r2=r2 + r1 − r(∞) 32-bit compiler gcc 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4).
r1=r1 − r3 Parallel execution of Bipartite, Tripartite and Proposed
now product polynomial r, Algorithm with the help of fork() system call in Linux
r(x)=r0+r1x+r2x2+r3x3+r4x4 operating system.
Finally, evaluate r(B) to obtain our final answer. This is
straightforward since B is a power of b and so the
multiplications by powers of B are all shifts by a whole V. CONCLUSIONS
number of digits in base b.
r(B)=r0+r1B+r2B2+r3B3+r4B4 This paper discusses various algorithms for modular
First step multiplication of number and splitting parts, multiplication of large numbers and evaluated them with
Modular multiplication step compute these parts as follows. respect to their accuracy, computation performance and
let B=R efficiency. Each algorithm has its own features suitable for a
r(R)= r0+r1R+r2R2+r3R3+r4R4 specific field of application. No single algorithm provides a
perfect solution to meet all demands; depending on the
r(R)R-2 = (r0+r1R+r2R2+r3R3+r4R4)R-2 environment in which computation are to be performed, one
= r0R-2+r1R-1+r2+r3R1+r4R2 algorithm may be preferable over another. Barrett and
Modular multiplication with M in both sides as follows: Montgomery Algorithm are efficient for smaller modular
(r(R)R-2)modM= (r0R-2+r1R-1+r2+r3R1+r4R2)modM multiplication but for large modular multiplication tripartite
=MontM(MontM(r0,1),1)+MontM(r1,1)+ and proposed Algorithm are efficient as shown in result.
BarrettM(r2,1)+BarrettM(r3,R)+BarrettM(r4,R2) The future work would be to use the Schönhage–Strassen
To obtain a high-speed implementation, one can compute algorithm for the multiplication step instead of Toom-Cook’s
these five different terms in parallel. where B=bi for method in proposed algorithm.
calculation of this algorithm we take i=k/3, Second modular
multiplication part having three Barrett algorithms and three
Montgomery algorithms requires 29/18k2+2k
2 2
multiplications, 29/3k +11k+12 additions, 17/6k +34/3k+16
reads, and 17/18k2+25/3k+10 writes 11/3k+8 subtraction. In
the above all algorithms they are slightly difference in there
number of operations in read, write, multiplication,
subtraction and addition, this algorithm execution in parallel
way so that time consuming is less than other algorithms.
.
103

ISSN: 2277 – 9043

REFERENCES
[1] W. Diffie and M.E. Hellman, “New Directions in Cryptography,”
IEEE Trans. Information Theory, vol. IT-22, no. 6, pp. 644-654,
Nov. 1976
[2] R.L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining
Digital Signatures and Public-Key Cryptosystems,” Comm.
ACM, vol. 21, no. 2, pp. 120-126, Feb. 1978.
[3] T. ElGamal, “A Public Key Cryptosystem and a Signature Scheme
Based on Discrete Logarithms,” IEEE Trans. Information
Theory, vol. 31, no. 4, pp. 469-472, July 1985.
[4] ANSI X9.30, Public Key Cryptography for the Financial Services
Industry: Part 1: The Digital Signature Algorithm (DSA), Am.
Nat’l Standards Inst., Am. Bankers Assoc., 1997.
[5] Marcelo E. Kaihara and Naofumi Takagi, “Bipartite Modular
Multiplication Method” IEEE Transactions on Computers, vol.
57, no. 2, pp. 157-164, Feb. 2008
[6] M. E. Kaihara and N. Takagi. Bipartite Modular Multiplication. In J.
R. Rao and B. Sunar, editors, Proceedings of 7th International
Workshop on Cryptographic Hardware and Embedded Systems
(CHES), number 3659 in Lecture Notes in Computer Science.
Springer-Verlag, 2005
[7] G.R. Blakley, “A Computer Algorithm for Calculating the Product AB
Modulo M,” IEEE Trans. Computers, vol. 32, no. 5, pp. 497-500,
May 1983.
[8] E.F. Brickell, “A Fast Modular Multiplication Algorithm with
Application to Two Key Cryptography,” Advances in Cryptology
Proc. CRYPTO ’82, pp. 51-60, 1983.
[9] P.L. Montgomery, “Modular Multiplication without Trial Division,”
Math. Computation, vol. 44, no. 170, pp. 519-521, Apr. 1985.
[10] K.R. Sloan, “Comments on a Computer Algorithm for Calculating the
Product AB Modulo M,” IEEE Trans. Computers, vol. 34, no. 3,
pp. 290-292, Mar. 1985.
[11] P.D. Barrett, “Implementing the Rivest Shamir and Adleman public
key encryption algorithm on a standard digital signal processor,”
Advances in Cryptology, Proc. Crypto’86, LNCS 263, A.M.
Odlyzko, Ed., Springer-Verlag, pp. 311–323, 1987.
[12] Bosselaers, R. Govaerts, and J. Vandewalle, "Comparison of Three
Modular Reduction Functions," Proc. CRYPTO'93, pp.175-186.
[13] Menezes, J., van Oorschot, P. C., and Vanstone, S. A., "Handbook of
Applied Cryptology," chapter 14.3.3, pp. 603-604.
[14] Marco Bodrato. Towards Optimal Toom–Cook Multiplication for
Univariate and Multivariate Polynomials in Characteristic 2 and
0. In WAIFI'07 proceedings, volume 4547 of LNCS, pages
116–133. June 21–22, 2007
[15] A. Toom. The Complexity of a Scheme of Functional Elements
Realizing the Multiplication of Integers.Translations of Dokl.
Adad. Nauk. SSSR, 3, 1963.
[16] A. Karatsuba and Y. Ofman. Multiplication of Many-Digital Numbers
by Automatic Computers. Translation in Physics-Doklady,
145:595-596, 7 1963.
[17] N. Koblitz. Elliptic Curve Cryptosystem. Math. Comp., 48:203-209,
1987.
[18] Ç.K. Koç, T. Acar, and BS Kaliski, “Analyzing and Comparing
Montgomery Multiplication Algorithms,” IEEE Micro, vol. 16,
no. 3, pp. 26-33, June 1996.
[19] Kazuo Sakiyama, Miroslav Knezevic, Junfeng Fan, Bart Preneel, and
Ingrid Verbauwhede. Tripartite modular multiplication.
Integration, 44(4):259 269, 2011
[20] Craig Gentry, Shai Halevi, and Vinod Vaikuntanathan. i-hop
homomorphic encryption and rerandomizable yao circuits. In Tal
Rabin, editor, CRYPTO, volume 6223 of Lecture Notes in
Computer Science, pages 155 172. Springer, 2010.

104

38 116-1-pb

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to 38 116-1-pb (20)

More from Mahendra Sisodia (11)

Recently uploaded (20)

38 116-1-pb