SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6304
Performance Analysis of RSA Algorithm with CUDA Parallel Computing
Rishikesh Kadam1, Vishakha Vidhani2
1,2 Student, Department Of Computer Engineering, V.E.S.I.T., Mumbai University, India.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - In today’s data driven and digitally connected
world, computer system security is of paramount
importance. The computer security relies on cryptographic
security techniques for protection of data. RSA is one of the
most used cryptographic algorithms which can be sped up
using parallel computing techniques. The advances in
computing capability of Graphics Processing Unit as a co-
processor of the system CPU allows parallelization of RSA
encryption and decryption. This paper focuses on parallel
implementation of RSA using Nvidia’s Compute Unified
Device Architecture (CUDA). First, the RSA cryptographic
algorithm is studied. Then, it is designed for implementation
in CUDA framework. In the last step, the RSA function is
called parallely by all nodes in a Hadoop cluster. For this
step, JCUDA is used. The results are analyzed mainly for the
speed up introduced by GPU as compared to the CPU-only
implementation.
Key Words: Cryptography, RSA, CUDA, GPU, Hadoop.
1. INTRODUCTION
RSA algorithm is named after its inventors Ron Rivest, Adi
Shamir and Leonard Adleman. It is a public key
cryptographic algorithm whose security relies on the
difficulty of finding prime factors of large numbers.[1] But
the modular power used in this algorithm acts as a
bottleneck for its performance which makes its large scale
implementations inefficient. Thus, the optimization of RSA
has been a research focus. The advancements in multi-
core processing and parallel computing is huge boost for
RSA optimization.
In last decade, the use of Graphical Processing Unit in
training of machine learning models and neural networks
has been a major trend, but along with that, GPU turns out
to be exclusively beneficial for general purpose parallel
computing [2] after the advent of Compute Unified Device
Architecture (CUDA) technology. Nvidia’s CUDA platform
makes thousands of stream processors available to
provide parallel speed up. This paper explores the
implementation of CUDA for parallelization of RSA
algorithm and presents the performance analysis. First,
the RSA algorithm is encapsulated in Java Native Interface
(JNI). Then, JCUDA is used to implement the
parallelization.
In section 2 of the paper, traditional principle of RSA
algorithm is studied. Section 3 gives an architectural
overview of the system hardware used for the
implementation. Section 4 presents the design and
implementation of RSA parallelization and section 5 shows
the results of performance analysis.
2. RSA ALGORITHM
RSA algorithm is the most commonly used asymmetric
cryptography algorithm. It works on two different keys i.e.
public key and private key. The public key is known to
everyone and is used for encrypting messages. Encrypted
Messages can only be decrypted using the private key.
The keys for the RSA algorithm are generated in the
following way:
1. Choose two distinct large prime numbers, p and q.
2. Calculate n = pq, n is used as the modulus for both the
public and private keys.
3. Compute φ(pq) = (p − 1)(q − 1). (φ is totient function).
4. Pick an integer e such that 1 < e < φ(pq), and e and
φ(pq) are coprime (e and φ(pq) share no divisors other
than 1). The public key consists of e (often called public
exponent) and the modulus n.
5. Find d which satisfies the relation d*e = 1(mod φ(pq)).
The private key consists of d (private exponent) and the
modulus n.
The message M is encrypted using formula C ≡ Me mod n,
where C is the encrypted message. And it is decrypted
using the formula M ≡ Cd mod n.
Example :
1. The plaintext M to be encrypted: 1314
2. Let p=397 and q=401
3. n= pq = 159,197
4. φ(pq) = 396 * 400 = 158,400
5. The public key(e,n) is (343, 159197) and private key d
is 12,007 (d*e = 1 mod φ(pq))
6. C ≡ Me mod n, C ≡ 1314343 (mod 159,197) = 33,677
7. M ≡ Cd mod n, M ≡ 33,67712,007 (mod 159,197) = 1314
3. CUDA ARCHITECTURE
NVIDIA has designed a special C-based language CUDA to
utilize the massively parallel nature of GPU.[3] Parallel
execution is expressed by the kernel function that is
executed on a set of threads in parallel on GPU; GPU is also
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6305
called the device. This kernel code is a C code for only one
thread. The number of thread blocks, and the number of
threads within those blocks that execute this kernel in
parallel are given explicitly when this function is called.
NVIDIA’s CUDA architecture includes a number of
multiprocessors, each multiprocessor consists of:
● 8 scalar processor cores
● 2 special function units for transcendentals
● 1 multithreaded instruction unit
● On-chip shared memory
Fig -1: CUDA Hardware
Each multiprocessor has SIMD architecture (single
instruction multiple data). Each processor of the
multiprocessor executes a different thread but all the
threads run the same instruction but operate on different
data.
Several threads (up to 512) may execute concurrently
within a multiprocessor and communicate through a small
shared memory bank (16KB) which can be seen in Fig -1.
Shared memory is local to each multiprocessor unlike
device memory and allows local synchronization to be
more efficient. It is divided into many parts. Each thread
block within multiprocessor accesses its own part of
shared memory. This part of shared memory is not
accessible by any other thread block of this multiprocessor
or of some other multiprocessor.
4. PARALLELIZATION
This paper presents a two-level parallelization of RSA
algorithm, node level and thread level as shown in Fig -2.
Fig -2: Two Level Parallelization
CUDA framework is used for parallel implementation on
thread level. RSA algorithm makes packets with the same
length by dividing the plaintext or ciphertext. The same
function will be executed for all the packets. As mentioned
in Section 2, the text is considered as integers. All the
packets are considered as arrays with same number of
integer elements. The encryption and decryption
processes are carried out on these packets. We can get the
thread and block index and then, they are used to assign
packets (elements) to each thread. Thus, on thread level,
CUDA multi-thread programming improves the
performance of RSA algorithm. [4]
The large size of plaintext or ciphertext can be
troublesome to handle on a single computer. We used a
distributed file system to store the text to be processed.
Each node (computer) in the cluster executes the parallel
RSA algorithm function. This is node level or computer
level parallelization. In such clusters, each node gets a
small part of data. In this case, a part of text is processed
by each node. Then, each node performs thread level
multi-processing.
This paper uses Hadoop Distributed File System (HDFS)
[5] for distribution of the text to all the nodes in the
cluster. HDFS is block-structured file system where each
file is divided into multiple blocks. The blocks are stored
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6306
across the cluster of nodes. It is a Java based software
framework. Google’s MapReduce [6] and Google File
System (GFS) [7] papers built the foundation for HDFS. As
Hadoop is based on pure Java, we need to find a method to
implement our C language code of RSA on Hadoop. JCUDA
[7] provides an interface for invoking CUDA kernels using
Java code or framework such as Hadoop. The syntax and
usage of JCUDA are beyond the scope of this paper. A
sample of kernel code is as follows:
__global__ static void RSACUDA(int* arr)
{
long p,q,e,d,m,n,t,c,i;
int x = blockDimension.x * blockId.x + threadId.x;
p=191;
q=223;
n=p*q;
t=(p-1)*(q-1);
e=t-1;
d=1;
while(((e*d)%t)!=1)
d++;
for(i=0;i<50;i++)
{
c = encrypt(*(arr+i+x*50),e,n);
m = decrypt(c,d,n);
*(arr+i+x*50) = m;
}
}
In next section, performance of parallel implementation is
discussed.
5. VERIFICATION
For verification, we divide the tests into two parts. First,
we run the RSA program using a CPU-only processing.
Then, in the next step, we implement the RSA algorithm on
CUDA framework.
5.1 TEST ENVIRONMENT
The computers used for the testing have following
configuration:
● CPU: Intel Core i5-5200U 2.20GHz
● GPU: Nvidia GeForce 920M (2048MB Memory)
● Memory: 2.20GHz DDR3 (8192MB)
For development and testing, Visual Studio 2017 is used.
For GPU processing, CUDA Toolkit 10.1 is used. Using the
Hadoop framework, the text was distributed on a cluster
of 3 nodes. JCUDA was involved to invoke CUDA kernels
from Java code on each HDFS node.
5.2 RESULTS
Table -1: Time taken for execution (in milliseconds) for
different text sizes.
No. of
chars
CPU
10
Threads
100
Threads
500
Threads
10000 10176.45 12023.84 1256.73 276.49
20000 20312.82 24096.33 2303.34 530.18
30000 31034.47 36201.93 3678.01 760.48
40000 40583.05 47936.17 4907.38 1015.74
50000 51396.11 61034.28 6179.57 1346.05
60000 61564.26 71124.05 7389.18 1592.38
70000 71891.49 83456.61 8599.06 1904.75
80000 82392.73 96345.34 9803.49 2267.11
Table -1 displays the relationship between the size of
input text and the time taken for execution (in
milliseconds) in traditional CPU-only mode and GPU
implemented multithreading mode.
Chart -1: Time taken for execution by CPU-only and GPU
multithreading mode.
Chart 1 represents this output graphically. The
relationship between size of input text and execution time
is linear. As we can see, the execution time is very less
when we implement RSA algorithm in 500-thread mode. It
is almost 40 to 50 times faster than the CPU-only mode.
This output proves that higher the degree of parallelism,
better is the performance.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6307
Chart -2: Elements processed per millisecond in CPU-only
and multithreading mode
If number of threads are less, about 10, RSA algorithm
takes longer takes longer than CPU-only mode. This is the
result of co-operation processing and communication
overhead between CPU and GPU. It majorly consists of
copying of data from the host computer (CPU) to the
device (GPU). This can be verified in Chart -2.
6. CONCLUSIONS
This paper presented the CUDA based implementation of
RSA algorithm. We described the traditional RSA
cryptosystem. The large size of text and modular power
function act as the bottleneck for performance which is a
drawback for large scale implementation.
We used Nvidia’s Compute Unified Device Architecture
(CUDA) for parallel implementation of RSA. First, RSA was
executed in CPU-only mode. Then, it was implemented in
CUDA framework in 10-thread, 100-thread and 500-
thread modes. The results were compared which made it
clear that higher the degree parallelism, better us the
performance. The 10-thread mode gave a poor
performance as compared to CPU-only mode. It was a
result of communication and data transfer overhead
between CPU and GPU.
Apache Hadoop framework can be used for distributed
storage of large text files. All the nodes in the cluster store
a part of the text file and execute RSA in parallel on the
data.
REFERENCES
[1] R.L. Rivest, A. Shamir, L. Adleman, “A Method for
Obtaining Digital Signatures and Public-Key
Cryptosystems,” Communications of the ACM, vol. 21,
iss. 2, pp. 120-126, February 1978.
[2] John D. Owens, David Luebke, Naga Govindaraju,
Mark Harris, Jens Krüger, Aaron E. Lefohn, Tim
Purcell, “A survey of general-purpose computation on
graphics hardware,” Computer Graphics Forum, vol.
26, pp. 80-113, March 2007.
[3] NVIDIA Corporation, NVIDIA Compute Unified Device
Architecture Programming Guide, Version 10.1,
February 2019.
[4] Daniel Page, Nigel P. Smart, “Parallel cryptographic
arithmetic using a redundant montgomery
representation,” IEEE Transactions on Computers, vol.
53, no. 11, pp. 1474-1482, November 2004.
[5] Konstantin Shvachko, Hairong Kuang, Sanjay Radia,
Robert Chansler, “The Hadoop Distributed File
System,” Proceedings of the 2010 IEEE 26th
Symposium on Mass Storage Systems and
Technologies, pp. 1-10, May 2010.
[6] Jeffrey Dean, Sanjay Ghemawat, “MapReduce:
Simplified Data Processing on Large Clusters,” Sixth
Symposium on Operating System Design and
Implementation, San Francisco, pp. 137-150,
December 2004.
[7] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung,
“The Google File System,” Proceedings of the 19th
ACM Symposium on Operating Systems Principles, NY,
pp. 20-43, October 2003.
[8] Yonghong Yan, Max Grossman, Vivek Sarkar, “JCUDA:
A Programmer-Friendly Interface for Accelerating
Java Programs with CUDA,” Proceedings of the 15th
International Euro-Par Conference on Parallel
Processing, pp. 887–899, August 2009.

More Related Content

PDF
Efficient algorithm for rsa text encryption using cuda c
PDF
PERFORMANCE ANALYSIS OF SHA-2 AND SHA-3 FINALISTS
PPTX
Simple regenerating codes: Network Coding for Cloud Storage
PDF
Jf3416691674
PDF
SSBSE10.ppt
DOC
Criptography approach using magnets
PDF
Network Coding for Distributed Storage Systems(Group Meeting Talk)
PDF
GPU Computing with CUDA
Efficient algorithm for rsa text encryption using cuda c
PERFORMANCE ANALYSIS OF SHA-2 AND SHA-3 FINALISTS
Simple regenerating codes: Network Coding for Cloud Storage
Jf3416691674
SSBSE10.ppt
Criptography approach using magnets
Network Coding for Distributed Storage Systems(Group Meeting Talk)
GPU Computing with CUDA

What's hot (19)

PDF
Analysis of rsa algorithm using gpu
PDF
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
PDF
Efficient Broadcast Authentication with Highest Life Span in Wireless Sensor ...
PPTX
Lec08 optimizations
PDF
LITTLE DRAGON TWO: AN EFFICIENT MULTIVARIATE PUBLIC KEY CRYPTOSYSTEM
PDF
Implementation of RSA Algorithm with Chinese Remainder Theorem for Modulus N ...
PDF
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
PDF
A Novel Method for Preventing Selective Jamming Attacks in Wireless Networks
PDF
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
PDF
Performance evaluation of ecc in single and multi( eliptic curve)
PDF
EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...
PDF
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
PDF
Solution(1)
PDF
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
PDF
Secure multipath routing scheme using key
PDF
An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
PDF
Zenoh Tutorial
PDF
Geo distributed parallelization pacts in map reduce
PPTX
Lec07 threading hw
Analysis of rsa algorithm using gpu
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Efficient Broadcast Authentication with Highest Life Span in Wireless Sensor ...
Lec08 optimizations
LITTLE DRAGON TWO: AN EFFICIENT MULTIVARIATE PUBLIC KEY CRYPTOSYSTEM
Implementation of RSA Algorithm with Chinese Remainder Theorem for Modulus N ...
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
A Novel Method for Preventing Selective Jamming Attacks in Wireless Networks
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
Performance evaluation of ecc in single and multi( eliptic curve)
EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
Solution(1)
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
Secure multipath routing scheme using key
An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
Zenoh Tutorial
Geo distributed parallelization pacts in map reduce
Lec07 threading hw
Ad

Similar to IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing (20)

PDF
Efficient algorithm for rsa text encryption using cuda c
PDF
Harvard poster
PDF
Nvidia cuda tutorial_no_nda_apr08
PDF
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
PDF
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
PDF
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
PPTX
GPU-Accelerated Parallel Computing
PPT
cuda.ppt
PDF
Cuda Without a Phd - A practical guick start
PPT
Lecture2 cuda spring 2010
PPT
Monte Carlo on GPUs
PPT
Introduction to parallel computing using CUDA
PPT
PDF
Cuda introduction
PDF
GPU: Understanding CUDA
PDF
Feasibility of Security in Micro-Controllers
PDF
Introduction to CUDA
PPT
3. CUDA_PPT.ppt info abt threads in cuda
PPTX
Brief Overview of a Parallel Nbody Code
PDF
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
Efficient algorithm for rsa text encryption using cuda c
Harvard poster
Nvidia cuda tutorial_no_nda_apr08
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
GPU-Accelerated Parallel Computing
cuda.ppt
Cuda Without a Phd - A practical guick start
Lecture2 cuda spring 2010
Monte Carlo on GPUs
Introduction to parallel computing using CUDA
Cuda introduction
GPU: Understanding CUDA
Feasibility of Security in Micro-Controllers
Introduction to CUDA
3. CUDA_PPT.ppt info abt threads in cuda
Brief Overview of a Parallel Nbody Code
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
composite construction of structures.pdf
PDF
PPT on Performance Review to get promotions
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
UNIT 4 Total Quality Management .pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Geodesy 1.pptx...............................................
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CH1 Production IntroductoryConcepts.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
bas. eng. economics group 4 presentation 1.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
composite construction of structures.pdf
PPT on Performance Review to get promotions
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
UNIT 4 Total Quality Management .pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
OOP with Java - Java Introduction (Basics)
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Lecture Notes Electrical Wiring System Components
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Geodesy 1.pptx...............................................
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CH1 Production IntroductoryConcepts.pptx

IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6304 Performance Analysis of RSA Algorithm with CUDA Parallel Computing Rishikesh Kadam1, Vishakha Vidhani2 1,2 Student, Department Of Computer Engineering, V.E.S.I.T., Mumbai University, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - In today’s data driven and digitally connected world, computer system security is of paramount importance. The computer security relies on cryptographic security techniques for protection of data. RSA is one of the most used cryptographic algorithms which can be sped up using parallel computing techniques. The advances in computing capability of Graphics Processing Unit as a co- processor of the system CPU allows parallelization of RSA encryption and decryption. This paper focuses on parallel implementation of RSA using Nvidia’s Compute Unified Device Architecture (CUDA). First, the RSA cryptographic algorithm is studied. Then, it is designed for implementation in CUDA framework. In the last step, the RSA function is called parallely by all nodes in a Hadoop cluster. For this step, JCUDA is used. The results are analyzed mainly for the speed up introduced by GPU as compared to the CPU-only implementation. Key Words: Cryptography, RSA, CUDA, GPU, Hadoop. 1. INTRODUCTION RSA algorithm is named after its inventors Ron Rivest, Adi Shamir and Leonard Adleman. It is a public key cryptographic algorithm whose security relies on the difficulty of finding prime factors of large numbers.[1] But the modular power used in this algorithm acts as a bottleneck for its performance which makes its large scale implementations inefficient. Thus, the optimization of RSA has been a research focus. The advancements in multi- core processing and parallel computing is huge boost for RSA optimization. In last decade, the use of Graphical Processing Unit in training of machine learning models and neural networks has been a major trend, but along with that, GPU turns out to be exclusively beneficial for general purpose parallel computing [2] after the advent of Compute Unified Device Architecture (CUDA) technology. Nvidia’s CUDA platform makes thousands of stream processors available to provide parallel speed up. This paper explores the implementation of CUDA for parallelization of RSA algorithm and presents the performance analysis. First, the RSA algorithm is encapsulated in Java Native Interface (JNI). Then, JCUDA is used to implement the parallelization. In section 2 of the paper, traditional principle of RSA algorithm is studied. Section 3 gives an architectural overview of the system hardware used for the implementation. Section 4 presents the design and implementation of RSA parallelization and section 5 shows the results of performance analysis. 2. RSA ALGORITHM RSA algorithm is the most commonly used asymmetric cryptography algorithm. It works on two different keys i.e. public key and private key. The public key is known to everyone and is used for encrypting messages. Encrypted Messages can only be decrypted using the private key. The keys for the RSA algorithm are generated in the following way: 1. Choose two distinct large prime numbers, p and q. 2. Calculate n = pq, n is used as the modulus for both the public and private keys. 3. Compute φ(pq) = (p − 1)(q − 1). (φ is totient function). 4. Pick an integer e such that 1 < e < φ(pq), and e and φ(pq) are coprime (e and φ(pq) share no divisors other than 1). The public key consists of e (often called public exponent) and the modulus n. 5. Find d which satisfies the relation d*e = 1(mod φ(pq)). The private key consists of d (private exponent) and the modulus n. The message M is encrypted using formula C ≡ Me mod n, where C is the encrypted message. And it is decrypted using the formula M ≡ Cd mod n. Example : 1. The plaintext M to be encrypted: 1314 2. Let p=397 and q=401 3. n= pq = 159,197 4. φ(pq) = 396 * 400 = 158,400 5. The public key(e,n) is (343, 159197) and private key d is 12,007 (d*e = 1 mod φ(pq)) 6. C ≡ Me mod n, C ≡ 1314343 (mod 159,197) = 33,677 7. M ≡ Cd mod n, M ≡ 33,67712,007 (mod 159,197) = 1314 3. CUDA ARCHITECTURE NVIDIA has designed a special C-based language CUDA to utilize the massively parallel nature of GPU.[3] Parallel execution is expressed by the kernel function that is executed on a set of threads in parallel on GPU; GPU is also
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6305 called the device. This kernel code is a C code for only one thread. The number of thread blocks, and the number of threads within those blocks that execute this kernel in parallel are given explicitly when this function is called. NVIDIA’s CUDA architecture includes a number of multiprocessors, each multiprocessor consists of: ● 8 scalar processor cores ● 2 special function units for transcendentals ● 1 multithreaded instruction unit ● On-chip shared memory Fig -1: CUDA Hardware Each multiprocessor has SIMD architecture (single instruction multiple data). Each processor of the multiprocessor executes a different thread but all the threads run the same instruction but operate on different data. Several threads (up to 512) may execute concurrently within a multiprocessor and communicate through a small shared memory bank (16KB) which can be seen in Fig -1. Shared memory is local to each multiprocessor unlike device memory and allows local synchronization to be more efficient. It is divided into many parts. Each thread block within multiprocessor accesses its own part of shared memory. This part of shared memory is not accessible by any other thread block of this multiprocessor or of some other multiprocessor. 4. PARALLELIZATION This paper presents a two-level parallelization of RSA algorithm, node level and thread level as shown in Fig -2. Fig -2: Two Level Parallelization CUDA framework is used for parallel implementation on thread level. RSA algorithm makes packets with the same length by dividing the plaintext or ciphertext. The same function will be executed for all the packets. As mentioned in Section 2, the text is considered as integers. All the packets are considered as arrays with same number of integer elements. The encryption and decryption processes are carried out on these packets. We can get the thread and block index and then, they are used to assign packets (elements) to each thread. Thus, on thread level, CUDA multi-thread programming improves the performance of RSA algorithm. [4] The large size of plaintext or ciphertext can be troublesome to handle on a single computer. We used a distributed file system to store the text to be processed. Each node (computer) in the cluster executes the parallel RSA algorithm function. This is node level or computer level parallelization. In such clusters, each node gets a small part of data. In this case, a part of text is processed by each node. Then, each node performs thread level multi-processing. This paper uses Hadoop Distributed File System (HDFS) [5] for distribution of the text to all the nodes in the cluster. HDFS is block-structured file system where each file is divided into multiple blocks. The blocks are stored
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6306 across the cluster of nodes. It is a Java based software framework. Google’s MapReduce [6] and Google File System (GFS) [7] papers built the foundation for HDFS. As Hadoop is based on pure Java, we need to find a method to implement our C language code of RSA on Hadoop. JCUDA [7] provides an interface for invoking CUDA kernels using Java code or framework such as Hadoop. The syntax and usage of JCUDA are beyond the scope of this paper. A sample of kernel code is as follows: __global__ static void RSACUDA(int* arr) { long p,q,e,d,m,n,t,c,i; int x = blockDimension.x * blockId.x + threadId.x; p=191; q=223; n=p*q; t=(p-1)*(q-1); e=t-1; d=1; while(((e*d)%t)!=1) d++; for(i=0;i<50;i++) { c = encrypt(*(arr+i+x*50),e,n); m = decrypt(c,d,n); *(arr+i+x*50) = m; } } In next section, performance of parallel implementation is discussed. 5. VERIFICATION For verification, we divide the tests into two parts. First, we run the RSA program using a CPU-only processing. Then, in the next step, we implement the RSA algorithm on CUDA framework. 5.1 TEST ENVIRONMENT The computers used for the testing have following configuration: ● CPU: Intel Core i5-5200U 2.20GHz ● GPU: Nvidia GeForce 920M (2048MB Memory) ● Memory: 2.20GHz DDR3 (8192MB) For development and testing, Visual Studio 2017 is used. For GPU processing, CUDA Toolkit 10.1 is used. Using the Hadoop framework, the text was distributed on a cluster of 3 nodes. JCUDA was involved to invoke CUDA kernels from Java code on each HDFS node. 5.2 RESULTS Table -1: Time taken for execution (in milliseconds) for different text sizes. No. of chars CPU 10 Threads 100 Threads 500 Threads 10000 10176.45 12023.84 1256.73 276.49 20000 20312.82 24096.33 2303.34 530.18 30000 31034.47 36201.93 3678.01 760.48 40000 40583.05 47936.17 4907.38 1015.74 50000 51396.11 61034.28 6179.57 1346.05 60000 61564.26 71124.05 7389.18 1592.38 70000 71891.49 83456.61 8599.06 1904.75 80000 82392.73 96345.34 9803.49 2267.11 Table -1 displays the relationship between the size of input text and the time taken for execution (in milliseconds) in traditional CPU-only mode and GPU implemented multithreading mode. Chart -1: Time taken for execution by CPU-only and GPU multithreading mode. Chart 1 represents this output graphically. The relationship between size of input text and execution time is linear. As we can see, the execution time is very less when we implement RSA algorithm in 500-thread mode. It is almost 40 to 50 times faster than the CPU-only mode. This output proves that higher the degree of parallelism, better is the performance.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6307 Chart -2: Elements processed per millisecond in CPU-only and multithreading mode If number of threads are less, about 10, RSA algorithm takes longer takes longer than CPU-only mode. This is the result of co-operation processing and communication overhead between CPU and GPU. It majorly consists of copying of data from the host computer (CPU) to the device (GPU). This can be verified in Chart -2. 6. CONCLUSIONS This paper presented the CUDA based implementation of RSA algorithm. We described the traditional RSA cryptosystem. The large size of text and modular power function act as the bottleneck for performance which is a drawback for large scale implementation. We used Nvidia’s Compute Unified Device Architecture (CUDA) for parallel implementation of RSA. First, RSA was executed in CPU-only mode. Then, it was implemented in CUDA framework in 10-thread, 100-thread and 500- thread modes. The results were compared which made it clear that higher the degree parallelism, better us the performance. The 10-thread mode gave a poor performance as compared to CPU-only mode. It was a result of communication and data transfer overhead between CPU and GPU. Apache Hadoop framework can be used for distributed storage of large text files. All the nodes in the cluster store a part of the text file and execute RSA in parallel on the data. REFERENCES [1] R.L. Rivest, A. Shamir, L. Adleman, “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,” Communications of the ACM, vol. 21, iss. 2, pp. 120-126, February 1978. [2] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, Tim Purcell, “A survey of general-purpose computation on graphics hardware,” Computer Graphics Forum, vol. 26, pp. 80-113, March 2007. [3] NVIDIA Corporation, NVIDIA Compute Unified Device Architecture Programming Guide, Version 10.1, February 2019. [4] Daniel Page, Nigel P. Smart, “Parallel cryptographic arithmetic using a redundant montgomery representation,” IEEE Transactions on Computers, vol. 53, no. 11, pp. 1474-1482, November 2004. [5] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System,” Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, pp. 1-10, May 2010. [6] Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Sixth Symposium on Operating System Design and Implementation, San Francisco, pp. 137-150, December 2004. [7] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, “The Google File System,” Proceedings of the 19th ACM Symposium on Operating Systems Principles, NY, pp. 20-43, October 2003. [8] Yonghong Yan, Max Grossman, Vivek Sarkar, “JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA,” Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp. 887–899, August 2009.