SlideShare a Scribd company logo
IAES International Journal of Artificial Intelligence (IJ-AI)
Vol. 13, No. 2, June 2024, pp. 2165∼2172
ISSN: 2252-8938, DOI: 10.11591/ijai.v13.i2.pp2165-2172 ❒ 2165
Generative adversarial network-based phishing URL
detection with variational autoencoder and transformer
Jishnu Kaitholikkal Sasi, Arthi Balakrishnan
Department of Computing Technologies, College of Engineering and Technology, SRM Institute of Science and Technology,
Kattankulathur Campus, Chengalpattu, TN, India
Article Info
Article history:
Received Jun 9, 2023
Revised Oct 31, 2023
Accepted Dec 2, 2023
Keywords:
Cyber crimes
Generative adversarial networks
Phishing URLs
Transformers
Variational autoencoders
ABSTRACT
Phishing attacks pose a constant threat to online security, necessitating the de-
velopment of efficient tools for identifying malicious URLs. In this article, we
propose a novel approach to detect phishing URLs employing a generative ad-
versarial network (GAN) with a variational autoencoder (VAE) as the genera-
tor and a transformer model with self-attention as the discriminator. The VAE
generator is trained to produce synthetic URLs. In contrast, the transformer dis-
criminator uses its self-attention mechanism to focus on the different parts of the
input URLs to extract crucial features. Our model uses adversarial training to
distinguish between legitimate and phishing URLs. We evaluate the effective-
ness of the proposed method using a large set of one million URLs that incor-
porate both authentic and phishing URLs. Experimental results show that our
model is effective, with an impressive accuracy of 97.75%, outperforming the
baseline models. This study significantly improves online security by offering a
novel and highly accurate phishing URL detection method.
This is an open access article under the CC BY-SA license.
Corresponding Author:
Arthi Balakrishnan
Department of Computing Technologies, College of Engineering and Technology
SRM Institute of Science and Technology
Kattankulathur Campus, SRM Nagar, Chengalpattu, Chennai 603203, Tamilnadu, India
Email: arthib@srmist.edu.in
1. INTRODUCTION
In this modern digital environment, cyber crimes are more widespread than ever. Among this, phishing
attacks are scandalous because of their wide range of attacking levels and the anonymity of attackers [1].
This poses a serious threat to individuals, organisations, and their sensitive data. In order to trick users into
disclosing private information, such as passwords, credit card numbers, or personal information, phishing
attempts frequently use fraudulent emails, messages, or websites that act as reliable sources [2]. A record-
breaking 1,270,883 phishing attacks were reported during the third quarter, according to the anti-phishing
working group (APWG) report from 2022. This is the highest quarterly figure ever recorded, highlighting
the need for quick and efficient solutions to stop this growing threat. With 23.2% of all reported incidents,
the financial sector in particular has been a top target for phishing attacks. Business email compromise (BEC)
attacks, a sophisticated form of phishing, have continued, and during the third quarter, there was a 59% increase
in wire transfer BEC attacks. The report also reveals a startling 1,000% increase in advance fee fraud scams
that are sent via email, highlighting the constantly evolving methods used by cybercriminals. It is essential to
recognise and block these phishing URLs in order to ensure online security and prevent users from becoming
victims of cybercriminals [3].
Journal homepage: http://ijai.iaescore.com
2166 ❒ ISSN: 2252-8938
Modern methods for identifying phishing URLs frequently rely on manual analysis, heuristics, or
rule-based algorithms, which find it difficult to keep up with the attackers’ constantly changing strategies
[4]. Therefore, it is imperative to create reliable and automated techniques for efficiently detecting phishing
URLs. Machine learning (ML) and deep learning techniques have recently demonstrated excellent results in a
number of fields, and their use in phishing URL identification has enormous promise [5]. Since ML can learn
automatically from the training dataset, numerous researchers have been looking at it for phishing detection.
Features are taken from the URL in this. To achieve the highest level of accuracy, many of them created
various feature extraction strategies for ML algorithms. With the help of ML algorithms like decision tree,
support vector machine, random forest, and K-nearest neighbor, some researchers are able to detect phishing
URLs with an accuracy of more than 90% [6]-[11]. The manual feature engineering of these models is their
primary flaw. It implies that because the data were extracted based on manual interpretations, it’s possible
that crucial features that machines might have picked up on were missed. After that point, more researchers
proposed deep learning-based techniques due to the automatic feature extraction characteristic of deep learning.
Phishing detection systems based on convolutional neural network (CNN) were given in methods [12]-[16],
while some researchers [17]-[21] utilized recurrent neural network (RNN), multilayer perceptron (MLP), long
short-term memory (LSTM), and its hybrid models. The majority of the approaches were more than 95%
accurate. However, the system’s biggest flaw was its inability to recognize dynamic phishing URLs and use
imbalanced data sets.
In this work, we propose a novel method for phishing URL detection using a generative adversarial
network (GAN) that comprises a variational autoencoder (VAE) as the generator and a transformer model with
self-attention as the discriminator. While VAEs can capture latent representations and generate realistic URL
samples, generator parts have demonstrated remarkable success in producing synthetic URLs. The transformer
model uses self-attention mechanisms to distinguish between legitimate and phishing URLs. The primary
goal of this research is to implement a reliable and accurate phishing URL detection system that can adapt
to changing phishing tactics and offer a strong defence against them. We aim to increase the accuracy and
effectiveness of phishing URL detection, enhancing online security for people and organisations, by using the
capabilities of GAN with VAE and transformer.
2. METHOD
The proposed methodology uses a GAN architecture with VAE as the generator and a transformer
model with self-attention as the discriminator. The system aims to accurately detect phishing URLs and en-
hance online security against attacks [18]. Figure 1 shows the workflow of the proposed model and the follow-
ing paragraphs outline the key components and steps of the proposed method.
Figure 1. Workflow of the proposed model
Int J Artif Intell, Vol. 13, No. 2, June 2024: 2165–2172
Int J Artif Intell ISSN: 2252-8938 ❒ 2167
2.1. Dataset preparation
Dataset preparation focuses on collecting and preparing the dataset. It comprises one million URLs,
including both phishing URLs from different sources as shown in Table 1 [22]. The CSV file consists of a
URL column and a label column with corresponding labels (0 for legitimate URLs and 1 for phishing URLs).
The tokenizer from Keras is used to tokenize URLs at the character level. Sequences of tokens are generated
using texts to sequences. pad sequences are used to ensure that all urls are of the same length. In order
to clean, normalise, remove duplicates, and ensure a balanced distribution between phishing and legitimate
URLs, various data preprocessing methods are used. The data set is split into 80% training and 20% testing
sets using train test split [23].
Table 1. Dataset source details
URL data set Source
Legitimate URLs Majestic Million, Common Crawl, Kaggle, GitHub, and Alexa
Phishing URLs PhishTank, Kaggle, Common Crawl, and OpenPhish
2.2. Variational autoencoder as generator
With the aim of developing a latent representation of URLs and producing synthetic URLs that closely
resemble genuine ones, the VAE is trained on the prepared dataset. By limiting the reconstruction loss, the
generator’s performance is optimised, resulting in the generation of realistic URLs. In order to learn a low-
dimensional representation (latent space) of the input URL sequences, the VAE model is built [24]. The maxi-
mum sequence length of the URLs in the dataset determines the input shape of the VAE. This guarantees that
each URL sequence will be processed with the same length. The encoder portion of the VAE is added as a thick
layer. The encoder layer reduces the input URL sequences’ dimensionality to the designated latent dimension
[25]. This is accomplished by projecting the input data into a lower-dimensional space via a non-linear trans-
formation (activation function). The decoder portion of the VAE is added as a further dense layer. The latent
representation created by the encoder is used by the decoder layer to reassemble the URL sequences. It returns
the data’s original dimensionality [26].
The Adam optimizer, a well-liked option for training neural networks, is used to create the VAE
model. The VAE’s aim is the binary cross-entropy loss function. It calculates the discrepancy between actual
input URL sequences and predicted URL sequences. The VAE model is trained on the training URLs using
a predetermined batch size and number of epochs. The model gains the ability to encode URLs into a lower-
dimensional representation and decode them again to recreate the original sequences during training. The
model is prompted to produce precise reconstructions of the input URLs by minimising the binary cross-
entropy loss during training. In this method, the VAE model is trained to recognise the key traits and patterns
in the input URL sequences. The remaining steps of the phishing URL detection system can make use of the
efficient encoding and reconstruction of the URLs made possible by the reduced-dimensional latent space [27].
2.3. Transformer model with self-attention as discriminator
As the discriminator element of the GAN, the transformer model with self-attention is introduced. The
transformer discriminator is pre-trained on a sizable corpus of text data, enabling it to extract URL context. To
enable the discriminator to accurately identify between legitimate and phishing URLs, fine-tuning is carried
out on the training set of URLs using a binary classification goal [28]. The discriminator model is constructed
to classify URLs. The discriminator model consists of several layers. Figure 2 represents the structure of the
discriminator.
Figure 2. Structure of discriminator
Generative adversarial network-based phishing URL detection with variational ... (Jishnu Kaitholikkal Sasi)
2168 ❒ ISSN: 2252-8938
− Embedding layer: the input URLs are converted to detailed vector representations by this layer. It helps
in capturing the semantic intent behind the URLs.
− Multi head attention layer: the self-attention technique is used by this layer to extract key details from the
embedded URL sequences. It enables the model to concentrate on different aspects of the input during
processing. Figure 3 shows the multi head attention mechanism [29]. In actuality, we simultaneously
compute the attention function on a collection of queries that are gathered into a matrix Q. In matrices K
and V, the keys and values are also condensed together. Multiple simultaneous attention layers combine
to create multi-head attention. The output matrix is calculated as follows:
MultiHead(Q, K, V ) = Concat(h1, h2, ..., h8)WO
(1)
hi = Attention(QWQ
i , KWK
i , V WV
i ) (2)
where the projections are parameter matrices:
WQ
i ∈ IRdmodel×dk
(3)
WK
i ∈ IRdmodel×dk
(4)
WV
i ∈ IRdmodel×dv
(5)
WO
∈ IRhdv×dmodel
(6)
In our system, h=8 attention layers are there and they are arranged parallel. The value of dk = dv = dmodel/h = 64.
− Normalisation layer: the outputs from the attention layer are normalised in this layer, guaranteeing stable
training and enhanced performance.
− Global average pooling 1D layer: by calculating the average value over the time dimension, this layer
can produce output with smaller spatial dimensions and URLs with defined lengths.
− Dense layer: for binary classification, a final dense layer with sigmoid activation is added. It generates a
probability score that indicates whether a URL is likely to be real or phishing.
The Adam optimizer, an effective method for neural network training, is used to create the discrimina-
tor model. As the loss function, binary cross-entropy loss is employed. It calculates the difference between the
actual labels and the expected probabilities. The model’s training performance in classification is also measured
using the accuracy metric. On the training URLs and their respective labels for the number of epochs, with a
256 batch size, the discriminator is trained. Based on the retrieved attributes, the model develops the ability
to distinguish between legal and phishing URLs during training. The Adam optimizer adjusts the weights of
the model to enhance classification performance while minimising the binary cross-entropy loss. The accuracy
statistic aids in tracking the model’s development throughout training. This method of training the discrimina-
tor teaches it to recognise patterns and characteristics that distinguish between legitimate and phishing URLs.
In order to focus on crucial information in the URL sequences, the model uses the self-attention mechanism.
The model then generates predictions based on the learned representations.
Int J Artif Intell, Vol. 13, No. 2, June 2024: 2165–2172
Int J Artif Intell ISSN: 2252-8938 ❒ 2169
Figure 3. Multiple simultaneous attention layers combine to create multi-head attention
2.4. Adversarial training
The generator and discriminator are used to train the GAN using adversarial training procedures. To
enhance their individual skills, the generator and discriminator are iteratively trained in a competitive manner.
While the discriminator learns to distinguish between actual and fake URLs, the generator creates fake phishing
URLs to trick it [30]. In this method, an adversarial training strategy is used to train a combined model that
combines the VAE and discriminator models.
− Freezing the discriminator weights: the discriminator’s trainable parameter is set to false prior to training
the combined model. By doing this, the discriminator weights are guaranteed to remain constant during
the adversarial training procedure. The VAE may concentrate on enhancing its reconstruction capability
without being impacted by the discriminator’s feedback because the discriminator’s weights are frozen
and are not modified.
− Model stacking: the combined model is built by combining the VAE and discriminator models. The dis-
criminator for categorization uses the output of the VAE as its input. This stacking enables the combined
model to carry out both the discriminator’s classification task and the VAE’s reconstruction mission.
− Compilation: the Adam optimizer, which is a well-liked optimizer for training neural networks, is used to
create the merged model. To measure the difference between the genuine labels and the expected output
of the discriminator, binary cross-entropy loss is utilised as the loss function.
− Training: the training URLs and labels for the training URLs are used to train the combined model
over a predetermined number of epochs and with a predetermined batch size. The integrated model is
simultaneously tuned during training to correctly classify the URLs and rebuild them accurately (VAE
objective). The system’s overall performance is enhanced by the model by taking use of the antagonistic
interactions between the discriminator and VAE.
The combined model is trained in such a way that the VAE learns to produce convincing URL recon-
structions that can fool the discriminator. At the same time, the discriminator gains proficiency in accurately
distinguishing between trustworthy and phishing URLs. This adversarial training procedure enhances the sys-
tem’s capacity to identify and distinguish between malicious and trustworthy URLs.
Generative adversarial network-based phishing URL detection with variational ... (Jishnu Kaitholikkal Sasi)
2170 ❒ ISSN: 2252-8938
3. RESULTS AND DISCUSSION
A dataset of one million URLs served as the basis for testing our suggested technique for detecting
phishing URLs. The major objective was to improve online safety against phishing attacks by accurately dif-
ferentiating between authentic and fraudulent URLs. The trial results show the efficiency of the suggested
strategy, detecting phishing URLs with an astonishing accuracy of 97.75%. To assess the superiority of our
model, we compared it with two other crucial deep learning-powered phishing detection models, as demon-
strated in Table 2. A graphical representation of this is provided in Figure 4, with our model exhibiting clear
dominance in every performance metric when compared to the other two.
Table 2. Comparison with baseline models
Models Used algorithm Dataset size Accuracy (%) Precision (%) Recall (%) F1-score (%)
Aljofey et al.
[31]
Character level
Convolution neural network
5,58,962 95.02 92.35 97.09 95.13
Ariyadasa et al.
[32]
Long-term recurrent
Convolutional network &
Graph convolutional network
1,36,096 96.42 96.40 96.44 96.42
Proposed system GAN using VAE
and transformers
1 Million 97.75 97.91 97.62 97.77
Figure 4. Comparison with baseline models
4. CONCLUSION
Using a transformer model with self-attention as the discriminator and a GAN with a VAE as the gen-
erator, we developed a unique method for phishing URL detection in this paper. Through intensive testing on
a sizable sample of one million URLs, we attained a phenomenal accuracy of 97.75%. This illustrates how our
suggested methodology might enhance internet security by accurately identifying phishing URLs. Our research
demonstrates the potential of GANs for phishing URL detection. By exploiting the VAE’s generation capabil-
ities, we were able to create believable synthetic URLs that evaluated the discriminator’s capacity to generate
precise classifications. By using the transformer model with self-attention, we were able to extract important
elements from the text and recognise the characteristic features of phishing URLs. The adversarial training
increased the model’s ability to discriminate, leading to better results than baseline models. Considering the
great outcomes of our suggested approach, there are still many of areas that can be researched and improved.
First, testing out different generator and discriminator designs might help the model perform better. We also
want to develop a deep learning-based browser plugin for real-time phishing URL identification.
REFERENCES
[1] H. Abroshan, J. Devos, G. Poels, and E. Laermans, “COVID-19 and Phishing: Effects of Human Emotions, Behavior, and De-
mographics on the Success of Phishing Attempts During the Pandemic,” IEEE Access, vol. 9, pp. 121916–121929, 2021, doi:
10.1109/ACCESS.2021.3109091.
[2] Z. Zhang, H. Al Hamadi, E. Damiani, C. Y. Yeun, and F. Taher, “Explainable Artificial Intelligence Applications in Cyber Security:
State-of-the-Art in Research,” IEEE Access, vol. 10, pp. 93104–93139, 2022, doi: 10.1109/ACCESS.2022.3204051.
Int J Artif Intell, Vol. 13, No. 2, June 2024: 2165–2172
Int J Artif Intell ISSN: 2252-8938 ❒ 2171
[3] P. López-Aguilar, C. Patsakis, and A. Solanas, “The Role of Extraversion in Phishing Victimisation: A System-
atic Literature Review,” in 2022 APWG Symposium on Electronic Crime Research (eCrime), 2022, pp. 1–10, doi:
10.1109/eCrime57793.2022.10142078.
[4] N. Q. Do, A. Selamat, O. Krejcar, E. Herrera-Viedma, and H. Fujita, “Deep Learning for Phishing Detection: Taxonomy, Current
Challenges and Future Directions,” IEEE Access, vol. 10, pp. 36429–36463, 2022, doi: 10.1109/ACCESS.2022.3151903.
[5] R. Zieni, L. Massari, and M. C. Calzarossa, “Phishing or Not Phishing? A Survey on the Detection of Phishing Websites,” IEEE
Access, vol. 11, pp. 18499–18519, 2023, doi: 10.1109/ACCESS.2023.3247135.
[6] A. Basit, M. Zafar, X. Liu, A. R. Javed, Z. Jalil, and K. Kifayat, “A comprehensive survey of AI-enabled phishing attacks detection
techniques,” Telecommunication Systems, vol. 76, no. 1, pp. 139–154, 2021, doi: 10.1007/s11235-020-00733-2.
[7] S. Anupam and A. K. Kar, “Phishing website detection using support vector machines and nature-inspired optimization algorithms,”
Telecommunication Systems, vol. 76, no. 1, pp. 17–32, 2021, doi: 10.1007/s11235-020-00739-w.
[8] H. Shirazi, S. R. Muramudalige, I. Ray, and A. P. Jayasumana, “Improved Phishing Detection Algorithms using Adversarial
Autoencoder Synthesized Data,” in 2020 IEEE 45th Conference on Local Computer Networks (LCN), 2020, pp. 24–32, doi:
10.1109/LCN48667.2020.9314775.
[9] M. S. M. Prince, A. Hasan, and F. M. Shah, “A New Ensemble Model for Phishing Detection Based on Hybrid Cumulative Feature
Selection,” in 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), 2021, pp. 7–12, doi:
10.1109/ISCAIE51753.2021.9431782.
[10] A. Butnaru, A. Mylonas, and N. Pitropakis, “Towards lightweight url-based phishing detection,” Future Internet, vol. 13, no. 6, pp.
1–15, 2021, doi: 10.3390/fi13060154.
[11] Ö. Kasim, “Automatic detection of phishing pages with event-based request processing, deep-hybrid feature extraction and light
gradient boosted machine model,” Telecommunication Systems, vol. 78, no. 1, pp. 103–115, 2021, doi: 10.1007/s11235-021-
00799-6.
[12] M. Korkmaz, E. Kocyigit, O. K. Sahingoz, and B. Diri, “Phishing Web Page Detection Using N-gram Features Extracted From
URLs,” in 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA),
2021, pp. 1–6, doi: 10.1109/HORA52670.2021.9461378.
[13] C. Opara, B. Wei, and Y. Chen, “HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Tech-
niques on HTML Analysis,” in 2020 International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–8, doi:
10.1109/IJCNN48605.2020.9207707.
[14] W. Wei, Q. Ke, J. Nowak, M. Korytkowski, R. Scherer, and M. Woźniak, “Accurate and fast URL phishing detector: A convolutional
neural network approach,” Computer Networks, vol. 178, pp. 1–13, 2020, doi: 10.1016/j.comnet.2020.107275.
[15] S. Abdelnabi, K. Krombholz, and M. Fritz, “VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity,” in
Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 1681–1698, doi:
10.1145/3372297.3417233.
[16] S. Singh, M. P. Singh, and R. Pandey, “Phishing Detection from URLs Using Deep Learning Approach,” in 2020 5th International
Conference on Computing, Communication and Security (ICCCS), 2020, pp. 1–4, doi: 10.1109/ICCCS49678.2020.9277459.
[17] P. Yang, G. Zhao, and P. Zeng, “Phishing website detection based on multidimensional features driven by deep learning,” IEEE
Access, vol. 7, pp. 15196–15209, 2019, doi: 10.1109/ACCESS.2019.2892066.
[18] A. AlEroud and G. Karabatis, “Bypassing Detection of URL-based Phishing Attacks Using Generative Adversarial Deep Neu-
ral Networks,” in Proceedings of the Sixth International Workshop on Security and Privacy Analytics, 2020, pp. 53–60, doi:
10.1145/3375708.3380315.
[19] T. Feng and C. Yue, “Visualizing and Interpreting RNN Models in URL-based Phishing Detection,” in Proceedings of the 25th ACM
Symposium on Access Control Models and Technologies, 2020, pp. 13–24, doi: 10.1145/3381991.3395602.
[20] I. Saha, D. Sarma, R. J. Chakma, M. N. Alam, A. Sultana, and S. Hossain, “Phishing Attacks Detection using Deep Learning
Approach,” in 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020, pp. 1180–1185,
doi: 10.1109/ICSSIT48917.2020.9214132.
[21] S. J. Bu and S. B. Cho, “Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing url
detection,” Electronics, vol. 10, no. 12, pp. 1–16, 2021, doi: 10.3390/electronics10121492.
[22] A. Safi and S. Singh, “A systematic literature review on phishing website detection techniques,” Journal of King Saud University -
Computer and Information Sciences, vol. 35, no. 2, pp. 590–611, 2023, doi: 10.1016/j.jksuci.2023.01.004.
[23] B. B. Gupta, K. Yadav, I. Razzak, K. Psannis, A. Castiglione, and X. Chang, “A novel approach for phishing URLs detection
using lexical based machine learning in a real-time environment,” Computer Communications, vol. 175, pp. 47–57, 2021, doi:
10.1016/j.comcom.2021.04.023.
[24] M. K. Prabakaran, P. M. Sundaram, and A. D. Chandrasekar, “An enhanced deep learning-based phishing detection mechanism to
effectively identify malicious URLs using variational autoencoders,” IET Information Security, vol. 17, no. 3, pp. 423–440, 2023,
doi: 10.1049/ise2.12106.
[25] L. Bergamin, T. Carraro, M. Polato, and F. Aiolli, “Novel Applications for VAE-based Anomaly Detection Systems,” in 2022 Inter-
national Joint Conference on Neural Networks (IJCNN), 2022, vol. 2022-July, pp. 1–8, doi: 10.1109/IJCNN55064.2022.9892879.
[26] G. Qi and H. Yu, “CMVAE: Causal Meta VAE for Unsupervised Meta-Learning,” in The Thirty-Seventh AAAI Conference on
Artificial Intelligenc, pp. 9480–9488, 2023, doi: 10.1609/aaai.v37i8.26135.
[27] L. Zhang, P. Zhang, L. Liu, and J. Tan, “Multiphish: Multi-Modal Features Fusion Networks for Phishing Detection,” in ICASSP
2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3520–3524, doi:
10.1109/ICASSP39728.2021.9415016.
[28] P. Maneriker, J. W. Stokes, E. G. Lazo, D. Carutasu, F. Tajaddodianfar, and A. Gururajan, “URLTran: Improving Phishing URL
Detection Using Transformers,” in MILCOM 2021 - 2021 IEEE Military Communications Conference (MILCOM), 2021, pp.
197–204, doi: 10.1109/MILCOM52596.2021.9653028.
[29] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5999–6009.
[30] S. Al-Ahmadi, A. Alotaibi, and O. Alsaleh, “PDGAN: Phishing Detection With Generative Adversarial Networks,” IEEE Access,
vol. 10, pp. 42459–42468, 2022, doi: 10.1109/ACCESS.2022.3168235.
Generative adversarial network-based phishing URL detection with variational ... (Jishnu Kaitholikkal Sasi)
2172 ❒ ISSN: 2252-8938
[31] A. Aljofey, Q. Jiang, Q. Qu, M. Huang, and J. P. Niyigena, “An effective phishing detection model based on character level convo-
lutional neural network from URL,” Electronics, vol. 9, no. 9, pp. 1–24, 2020, doi: 10.3390/electronics9091514.
[32] S. Ariyadasa, S. Fernando, and S. Fernando, “Combining Long-Term Recurrent Convolutional and Graph Convolutional Net-
works to Detect Phishing Sites Using URL and HTML,” IEEE Access, vol. 10, pp. 82355–82375, 2022, doi: 10.1109/AC-
CESS.2022.3196018.
BIOGRAPHIES OF AUTHORS
Jishnu Kaitholikkal Sasi is pursuing a Ph.D. in Computer Science from the SRM Institute
of Science and Technology. He completed his B.Tech. in Computer Science from KMCT College
of Engineering in 2017 and received an M.Tech. degree in Computer Science from Government En-
gineering College, Wayanad, in 2021. He has good programming skills and is proficient in software
engineering, deep learning, and blockchain. His research mainly focuses on detecting phishing URLs
using deep learning. He can be contacted at email: js2963@srmist.edu.in.
Arthi Balakrishnan holds a Ph.D. degree in the field of Computer Science and Engineer-
ing from Anna University. She has 15+ years of experience in teaching. Her areas of interest include
artificial intelligence, machine learning, software engineering, and Internet of Things. She has pub-
lished several articles in various reputed national and international, such as Scopus and SCI journals.
She has also authored book chapters in CRC and Springer Publications. She has presented papers in
various national and international conferences and attended many workshops, seminars, and faculty
development programs to be in track with the changing technology and teaching methodology. She
is a member of various scientific and professional bodies. She has been awarded the IET Inspiring
Young Teacher Award for the year 2016-2017 for the IET Chennai. She can be contacted at email:
arthib@srmist.edu.in.
Int J Artif Intell, Vol. 13, No. 2, June 2024: 2165–2172

More Related Content

PDF
A multi-algorithm approach for phishing uniform resource locator’s detection
PDF
Phishing Detection using Decision Tree Model
PDF
IRJET - Chrome Extension for Detecting Phishing Websites
PDF
Phishing Website Detection Paradigm using XGBoost
PDF
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
PDF
DETECTION OF PHISHING WEBSITES USING MACHINE LEARNING
PDF
Phishing Website Detection using Classification Algorithms
PDF
Detection of Phishing Websites
A multi-algorithm approach for phishing uniform resource locator’s detection
Phishing Detection using Decision Tree Model
IRJET - Chrome Extension for Detecting Phishing Websites
Phishing Website Detection Paradigm using XGBoost
A Hybrid Approach For Phishing Website Detection Using Machine Learning.
DETECTION OF PHISHING WEBSITES USING MACHINE LEARNING
Phishing Website Detection using Classification Algorithms
Detection of Phishing Websites

Similar to Generative adversarial network-based phishing URL detection with variational autoencoder and transformer (20)

PDF
PDMLP: PHISHING DETECTION USING MULTILAYER PERCEPTRON
PPTX
PHISHING URL DETECTION AND MALICIOUS LINK
PDF
[IJET V2I5P15] Authors: V.Preethi, G.Velmayil
PDF
IRJET- Phishing Website Detection based on Machine Learning
PDF
IRJET - Phishing Attack Detection and Prevention using Linkguard Algorithm
PDF
IRJET - An Automated System for Detection of Social Engineering Phishing Atta...
PDF
IRJET- Detecting Phishing Websites using Machine Learning
PDF
Phishing Website Detection Using Machine Learning
PDF
Detection of Phishing Websites using machine Learning Algorithm
PPTX
193002138_Fake-Website-Detection-Fina-updatet-ppt.pptx
PDF
USING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLS
PDF
IRJET- Machine Learning Techniques to Seek Out Malicious Websites
PDF
Detecting Phishing using Machine Learning
PDF
Phishing Website Detection Using Machine Learning
PDF
Detecting Phishing Websites Using Machine Learning
PDF
IRJET- Noisy Content Detection on Web Data using Machine Learning
PDF
IRJET - PHISCAN : Phishing Detector Plugin using Machine Learning
PDF
PHISHING URL DETECTION USING MACHINE LEARNING
PDF
IRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
PDF
Malicious Link Detection System
PDMLP: PHISHING DETECTION USING MULTILAYER PERCEPTRON
PHISHING URL DETECTION AND MALICIOUS LINK
[IJET V2I5P15] Authors: V.Preethi, G.Velmayil
IRJET- Phishing Website Detection based on Machine Learning
IRJET - Phishing Attack Detection and Prevention using Linkguard Algorithm
IRJET - An Automated System for Detection of Social Engineering Phishing Atta...
IRJET- Detecting Phishing Websites using Machine Learning
Phishing Website Detection Using Machine Learning
Detection of Phishing Websites using machine Learning Algorithm
193002138_Fake-Website-Detection-Fina-updatet-ppt.pptx
USING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLS
IRJET- Machine Learning Techniques to Seek Out Malicious Websites
Detecting Phishing using Machine Learning
Phishing Website Detection Using Machine Learning
Detecting Phishing Websites Using Machine Learning
IRJET- Noisy Content Detection on Web Data using Machine Learning
IRJET - PHISCAN : Phishing Detector Plugin using Machine Learning
PHISHING URL DETECTION USING MACHINE LEARNING
IRJET- Detecting the Phishing Websites using Enhance Secure Algorithm
Malicious Link Detection System
Ad

More from IAESIJAI (20)

PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Graph-based methods for transaction databases: a comparative study
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Evaluating ChatGPT’s Mandarin “yue” pronunciation system in language learning
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Multi-task deep learning for Vietnamese capitalization and punctuation recogn...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Automatic detection of dress-code surveillance in a university using YOLO alg...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
Improved convolutional neural networks for aircraft type classification in re...
PDF
Primary phase Alzheimer's disease detection using ensemble learning model
PDF
Deep learning-based techniques for video enhancement, compression and restora...
PDF
Hybrid model detection and classification of lung cancer
PDF
Adaptive kernel integration in visual geometry group 16 for enhanced classifi...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Enhancing fall detection and classification using Jarratt‐butterfly optimizat...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Graph-based methods for transaction databases: a comparative study
Developing a website for English-speaking practice to English as a foreign la...
Evaluating ChatGPT’s Mandarin “yue” pronunciation system in language learning
A contest of sentiment analysis: k-nearest neighbor versus neural network
A proposed approach for plagiarism detection in Myanmar Unicode text
Multi-task deep learning for Vietnamese capitalization and punctuation recogn...
A comparative study of natural language inference in Swahili using monolingua...
Abstractive summarization using multilingual text-to-text transfer transforme...
Enhancing emotion recognition model for a student engagement use case through...
Automatic detection of dress-code surveillance in a university using YOLO alg...
Hindi spoken digit analysis for native and non-native speakers
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
Improved convolutional neural networks for aircraft type classification in re...
Primary phase Alzheimer's disease detection using ensemble learning model
Deep learning-based techniques for video enhancement, compression and restora...
Hybrid model detection and classification of lung cancer
Adaptive kernel integration in visual geometry group 16 for enhanced classifi...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Enhancing fall detection and classification using Jarratt‐butterfly optimizat...
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mushroom cultivation and it's methods.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Tartificialntelligence_presentation.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
OMC Textile Division Presentation 2021.pptx
Machine learning based COVID-19 study performance prediction
Mushroom cultivation and it's methods.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
cloud_computing_Infrastucture_as_cloud_p
Tartificialntelligence_presentation.pptx
Getting Started with Data Integration: FME Form 101
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
Univ-Connecticut-ChatGPT-Presentaion.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
A comparative analysis of optical character recognition models for extracting...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Generative adversarial network-based phishing URL detection with variational autoencoder and transformer

  • 1. IAES International Journal of Artificial Intelligence (IJ-AI) Vol. 13, No. 2, June 2024, pp. 2165∼2172 ISSN: 2252-8938, DOI: 10.11591/ijai.v13.i2.pp2165-2172 ❒ 2165 Generative adversarial network-based phishing URL detection with variational autoencoder and transformer Jishnu Kaitholikkal Sasi, Arthi Balakrishnan Department of Computing Technologies, College of Engineering and Technology, SRM Institute of Science and Technology, Kattankulathur Campus, Chengalpattu, TN, India Article Info Article history: Received Jun 9, 2023 Revised Oct 31, 2023 Accepted Dec 2, 2023 Keywords: Cyber crimes Generative adversarial networks Phishing URLs Transformers Variational autoencoders ABSTRACT Phishing attacks pose a constant threat to online security, necessitating the de- velopment of efficient tools for identifying malicious URLs. In this article, we propose a novel approach to detect phishing URLs employing a generative ad- versarial network (GAN) with a variational autoencoder (VAE) as the genera- tor and a transformer model with self-attention as the discriminator. The VAE generator is trained to produce synthetic URLs. In contrast, the transformer dis- criminator uses its self-attention mechanism to focus on the different parts of the input URLs to extract crucial features. Our model uses adversarial training to distinguish between legitimate and phishing URLs. We evaluate the effective- ness of the proposed method using a large set of one million URLs that incor- porate both authentic and phishing URLs. Experimental results show that our model is effective, with an impressive accuracy of 97.75%, outperforming the baseline models. This study significantly improves online security by offering a novel and highly accurate phishing URL detection method. This is an open access article under the CC BY-SA license. Corresponding Author: Arthi Balakrishnan Department of Computing Technologies, College of Engineering and Technology SRM Institute of Science and Technology Kattankulathur Campus, SRM Nagar, Chengalpattu, Chennai 603203, Tamilnadu, India Email: arthib@srmist.edu.in 1. INTRODUCTION In this modern digital environment, cyber crimes are more widespread than ever. Among this, phishing attacks are scandalous because of their wide range of attacking levels and the anonymity of attackers [1]. This poses a serious threat to individuals, organisations, and their sensitive data. In order to trick users into disclosing private information, such as passwords, credit card numbers, or personal information, phishing attempts frequently use fraudulent emails, messages, or websites that act as reliable sources [2]. A record- breaking 1,270,883 phishing attacks were reported during the third quarter, according to the anti-phishing working group (APWG) report from 2022. This is the highest quarterly figure ever recorded, highlighting the need for quick and efficient solutions to stop this growing threat. With 23.2% of all reported incidents, the financial sector in particular has been a top target for phishing attacks. Business email compromise (BEC) attacks, a sophisticated form of phishing, have continued, and during the third quarter, there was a 59% increase in wire transfer BEC attacks. The report also reveals a startling 1,000% increase in advance fee fraud scams that are sent via email, highlighting the constantly evolving methods used by cybercriminals. It is essential to recognise and block these phishing URLs in order to ensure online security and prevent users from becoming victims of cybercriminals [3]. Journal homepage: http://ijai.iaescore.com
  • 2. 2166 ❒ ISSN: 2252-8938 Modern methods for identifying phishing URLs frequently rely on manual analysis, heuristics, or rule-based algorithms, which find it difficult to keep up with the attackers’ constantly changing strategies [4]. Therefore, it is imperative to create reliable and automated techniques for efficiently detecting phishing URLs. Machine learning (ML) and deep learning techniques have recently demonstrated excellent results in a number of fields, and their use in phishing URL identification has enormous promise [5]. Since ML can learn automatically from the training dataset, numerous researchers have been looking at it for phishing detection. Features are taken from the URL in this. To achieve the highest level of accuracy, many of them created various feature extraction strategies for ML algorithms. With the help of ML algorithms like decision tree, support vector machine, random forest, and K-nearest neighbor, some researchers are able to detect phishing URLs with an accuracy of more than 90% [6]-[11]. The manual feature engineering of these models is their primary flaw. It implies that because the data were extracted based on manual interpretations, it’s possible that crucial features that machines might have picked up on were missed. After that point, more researchers proposed deep learning-based techniques due to the automatic feature extraction characteristic of deep learning. Phishing detection systems based on convolutional neural network (CNN) were given in methods [12]-[16], while some researchers [17]-[21] utilized recurrent neural network (RNN), multilayer perceptron (MLP), long short-term memory (LSTM), and its hybrid models. The majority of the approaches were more than 95% accurate. However, the system’s biggest flaw was its inability to recognize dynamic phishing URLs and use imbalanced data sets. In this work, we propose a novel method for phishing URL detection using a generative adversarial network (GAN) that comprises a variational autoencoder (VAE) as the generator and a transformer model with self-attention as the discriminator. While VAEs can capture latent representations and generate realistic URL samples, generator parts have demonstrated remarkable success in producing synthetic URLs. The transformer model uses self-attention mechanisms to distinguish between legitimate and phishing URLs. The primary goal of this research is to implement a reliable and accurate phishing URL detection system that can adapt to changing phishing tactics and offer a strong defence against them. We aim to increase the accuracy and effectiveness of phishing URL detection, enhancing online security for people and organisations, by using the capabilities of GAN with VAE and transformer. 2. METHOD The proposed methodology uses a GAN architecture with VAE as the generator and a transformer model with self-attention as the discriminator. The system aims to accurately detect phishing URLs and en- hance online security against attacks [18]. Figure 1 shows the workflow of the proposed model and the follow- ing paragraphs outline the key components and steps of the proposed method. Figure 1. Workflow of the proposed model Int J Artif Intell, Vol. 13, No. 2, June 2024: 2165–2172
  • 3. Int J Artif Intell ISSN: 2252-8938 ❒ 2167 2.1. Dataset preparation Dataset preparation focuses on collecting and preparing the dataset. It comprises one million URLs, including both phishing URLs from different sources as shown in Table 1 [22]. The CSV file consists of a URL column and a label column with corresponding labels (0 for legitimate URLs and 1 for phishing URLs). The tokenizer from Keras is used to tokenize URLs at the character level. Sequences of tokens are generated using texts to sequences. pad sequences are used to ensure that all urls are of the same length. In order to clean, normalise, remove duplicates, and ensure a balanced distribution between phishing and legitimate URLs, various data preprocessing methods are used. The data set is split into 80% training and 20% testing sets using train test split [23]. Table 1. Dataset source details URL data set Source Legitimate URLs Majestic Million, Common Crawl, Kaggle, GitHub, and Alexa Phishing URLs PhishTank, Kaggle, Common Crawl, and OpenPhish 2.2. Variational autoencoder as generator With the aim of developing a latent representation of URLs and producing synthetic URLs that closely resemble genuine ones, the VAE is trained on the prepared dataset. By limiting the reconstruction loss, the generator’s performance is optimised, resulting in the generation of realistic URLs. In order to learn a low- dimensional representation (latent space) of the input URL sequences, the VAE model is built [24]. The maxi- mum sequence length of the URLs in the dataset determines the input shape of the VAE. This guarantees that each URL sequence will be processed with the same length. The encoder portion of the VAE is added as a thick layer. The encoder layer reduces the input URL sequences’ dimensionality to the designated latent dimension [25]. This is accomplished by projecting the input data into a lower-dimensional space via a non-linear trans- formation (activation function). The decoder portion of the VAE is added as a further dense layer. The latent representation created by the encoder is used by the decoder layer to reassemble the URL sequences. It returns the data’s original dimensionality [26]. The Adam optimizer, a well-liked option for training neural networks, is used to create the VAE model. The VAE’s aim is the binary cross-entropy loss function. It calculates the discrepancy between actual input URL sequences and predicted URL sequences. The VAE model is trained on the training URLs using a predetermined batch size and number of epochs. The model gains the ability to encode URLs into a lower- dimensional representation and decode them again to recreate the original sequences during training. The model is prompted to produce precise reconstructions of the input URLs by minimising the binary cross- entropy loss during training. In this method, the VAE model is trained to recognise the key traits and patterns in the input URL sequences. The remaining steps of the phishing URL detection system can make use of the efficient encoding and reconstruction of the URLs made possible by the reduced-dimensional latent space [27]. 2.3. Transformer model with self-attention as discriminator As the discriminator element of the GAN, the transformer model with self-attention is introduced. The transformer discriminator is pre-trained on a sizable corpus of text data, enabling it to extract URL context. To enable the discriminator to accurately identify between legitimate and phishing URLs, fine-tuning is carried out on the training set of URLs using a binary classification goal [28]. The discriminator model is constructed to classify URLs. The discriminator model consists of several layers. Figure 2 represents the structure of the discriminator. Figure 2. Structure of discriminator Generative adversarial network-based phishing URL detection with variational ... (Jishnu Kaitholikkal Sasi)
  • 4. 2168 ❒ ISSN: 2252-8938 − Embedding layer: the input URLs are converted to detailed vector representations by this layer. It helps in capturing the semantic intent behind the URLs. − Multi head attention layer: the self-attention technique is used by this layer to extract key details from the embedded URL sequences. It enables the model to concentrate on different aspects of the input during processing. Figure 3 shows the multi head attention mechanism [29]. In actuality, we simultaneously compute the attention function on a collection of queries that are gathered into a matrix Q. In matrices K and V, the keys and values are also condensed together. Multiple simultaneous attention layers combine to create multi-head attention. The output matrix is calculated as follows: MultiHead(Q, K, V ) = Concat(h1, h2, ..., h8)WO (1) hi = Attention(QWQ i , KWK i , V WV i ) (2) where the projections are parameter matrices: WQ i ∈ IRdmodel×dk (3) WK i ∈ IRdmodel×dk (4) WV i ∈ IRdmodel×dv (5) WO ∈ IRhdv×dmodel (6) In our system, h=8 attention layers are there and they are arranged parallel. The value of dk = dv = dmodel/h = 64. − Normalisation layer: the outputs from the attention layer are normalised in this layer, guaranteeing stable training and enhanced performance. − Global average pooling 1D layer: by calculating the average value over the time dimension, this layer can produce output with smaller spatial dimensions and URLs with defined lengths. − Dense layer: for binary classification, a final dense layer with sigmoid activation is added. It generates a probability score that indicates whether a URL is likely to be real or phishing. The Adam optimizer, an effective method for neural network training, is used to create the discrimina- tor model. As the loss function, binary cross-entropy loss is employed. It calculates the difference between the actual labels and the expected probabilities. The model’s training performance in classification is also measured using the accuracy metric. On the training URLs and their respective labels for the number of epochs, with a 256 batch size, the discriminator is trained. Based on the retrieved attributes, the model develops the ability to distinguish between legal and phishing URLs during training. The Adam optimizer adjusts the weights of the model to enhance classification performance while minimising the binary cross-entropy loss. The accuracy statistic aids in tracking the model’s development throughout training. This method of training the discrimina- tor teaches it to recognise patterns and characteristics that distinguish between legitimate and phishing URLs. In order to focus on crucial information in the URL sequences, the model uses the self-attention mechanism. The model then generates predictions based on the learned representations. Int J Artif Intell, Vol. 13, No. 2, June 2024: 2165–2172
  • 5. Int J Artif Intell ISSN: 2252-8938 ❒ 2169 Figure 3. Multiple simultaneous attention layers combine to create multi-head attention 2.4. Adversarial training The generator and discriminator are used to train the GAN using adversarial training procedures. To enhance their individual skills, the generator and discriminator are iteratively trained in a competitive manner. While the discriminator learns to distinguish between actual and fake URLs, the generator creates fake phishing URLs to trick it [30]. In this method, an adversarial training strategy is used to train a combined model that combines the VAE and discriminator models. − Freezing the discriminator weights: the discriminator’s trainable parameter is set to false prior to training the combined model. By doing this, the discriminator weights are guaranteed to remain constant during the adversarial training procedure. The VAE may concentrate on enhancing its reconstruction capability without being impacted by the discriminator’s feedback because the discriminator’s weights are frozen and are not modified. − Model stacking: the combined model is built by combining the VAE and discriminator models. The dis- criminator for categorization uses the output of the VAE as its input. This stacking enables the combined model to carry out both the discriminator’s classification task and the VAE’s reconstruction mission. − Compilation: the Adam optimizer, which is a well-liked optimizer for training neural networks, is used to create the merged model. To measure the difference between the genuine labels and the expected output of the discriminator, binary cross-entropy loss is utilised as the loss function. − Training: the training URLs and labels for the training URLs are used to train the combined model over a predetermined number of epochs and with a predetermined batch size. The integrated model is simultaneously tuned during training to correctly classify the URLs and rebuild them accurately (VAE objective). The system’s overall performance is enhanced by the model by taking use of the antagonistic interactions between the discriminator and VAE. The combined model is trained in such a way that the VAE learns to produce convincing URL recon- structions that can fool the discriminator. At the same time, the discriminator gains proficiency in accurately distinguishing between trustworthy and phishing URLs. This adversarial training procedure enhances the sys- tem’s capacity to identify and distinguish between malicious and trustworthy URLs. Generative adversarial network-based phishing URL detection with variational ... (Jishnu Kaitholikkal Sasi)
  • 6. 2170 ❒ ISSN: 2252-8938 3. RESULTS AND DISCUSSION A dataset of one million URLs served as the basis for testing our suggested technique for detecting phishing URLs. The major objective was to improve online safety against phishing attacks by accurately dif- ferentiating between authentic and fraudulent URLs. The trial results show the efficiency of the suggested strategy, detecting phishing URLs with an astonishing accuracy of 97.75%. To assess the superiority of our model, we compared it with two other crucial deep learning-powered phishing detection models, as demon- strated in Table 2. A graphical representation of this is provided in Figure 4, with our model exhibiting clear dominance in every performance metric when compared to the other two. Table 2. Comparison with baseline models Models Used algorithm Dataset size Accuracy (%) Precision (%) Recall (%) F1-score (%) Aljofey et al. [31] Character level Convolution neural network 5,58,962 95.02 92.35 97.09 95.13 Ariyadasa et al. [32] Long-term recurrent Convolutional network & Graph convolutional network 1,36,096 96.42 96.40 96.44 96.42 Proposed system GAN using VAE and transformers 1 Million 97.75 97.91 97.62 97.77 Figure 4. Comparison with baseline models 4. CONCLUSION Using a transformer model with self-attention as the discriminator and a GAN with a VAE as the gen- erator, we developed a unique method for phishing URL detection in this paper. Through intensive testing on a sizable sample of one million URLs, we attained a phenomenal accuracy of 97.75%. This illustrates how our suggested methodology might enhance internet security by accurately identifying phishing URLs. Our research demonstrates the potential of GANs for phishing URL detection. By exploiting the VAE’s generation capabil- ities, we were able to create believable synthetic URLs that evaluated the discriminator’s capacity to generate precise classifications. By using the transformer model with self-attention, we were able to extract important elements from the text and recognise the characteristic features of phishing URLs. The adversarial training increased the model’s ability to discriminate, leading to better results than baseline models. Considering the great outcomes of our suggested approach, there are still many of areas that can be researched and improved. First, testing out different generator and discriminator designs might help the model perform better. We also want to develop a deep learning-based browser plugin for real-time phishing URL identification. REFERENCES [1] H. Abroshan, J. Devos, G. Poels, and E. Laermans, “COVID-19 and Phishing: Effects of Human Emotions, Behavior, and De- mographics on the Success of Phishing Attempts During the Pandemic,” IEEE Access, vol. 9, pp. 121916–121929, 2021, doi: 10.1109/ACCESS.2021.3109091. [2] Z. Zhang, H. Al Hamadi, E. Damiani, C. Y. Yeun, and F. Taher, “Explainable Artificial Intelligence Applications in Cyber Security: State-of-the-Art in Research,” IEEE Access, vol. 10, pp. 93104–93139, 2022, doi: 10.1109/ACCESS.2022.3204051. Int J Artif Intell, Vol. 13, No. 2, June 2024: 2165–2172
  • 7. Int J Artif Intell ISSN: 2252-8938 ❒ 2171 [3] P. López-Aguilar, C. Patsakis, and A. Solanas, “The Role of Extraversion in Phishing Victimisation: A System- atic Literature Review,” in 2022 APWG Symposium on Electronic Crime Research (eCrime), 2022, pp. 1–10, doi: 10.1109/eCrime57793.2022.10142078. [4] N. Q. Do, A. Selamat, O. Krejcar, E. Herrera-Viedma, and H. Fujita, “Deep Learning for Phishing Detection: Taxonomy, Current Challenges and Future Directions,” IEEE Access, vol. 10, pp. 36429–36463, 2022, doi: 10.1109/ACCESS.2022.3151903. [5] R. Zieni, L. Massari, and M. C. Calzarossa, “Phishing or Not Phishing? A Survey on the Detection of Phishing Websites,” IEEE Access, vol. 11, pp. 18499–18519, 2023, doi: 10.1109/ACCESS.2023.3247135. [6] A. Basit, M. Zafar, X. Liu, A. R. Javed, Z. Jalil, and K. Kifayat, “A comprehensive survey of AI-enabled phishing attacks detection techniques,” Telecommunication Systems, vol. 76, no. 1, pp. 139–154, 2021, doi: 10.1007/s11235-020-00733-2. [7] S. Anupam and A. K. Kar, “Phishing website detection using support vector machines and nature-inspired optimization algorithms,” Telecommunication Systems, vol. 76, no. 1, pp. 17–32, 2021, doi: 10.1007/s11235-020-00739-w. [8] H. Shirazi, S. R. Muramudalige, I. Ray, and A. P. Jayasumana, “Improved Phishing Detection Algorithms using Adversarial Autoencoder Synthesized Data,” in 2020 IEEE 45th Conference on Local Computer Networks (LCN), 2020, pp. 24–32, doi: 10.1109/LCN48667.2020.9314775. [9] M. S. M. Prince, A. Hasan, and F. M. Shah, “A New Ensemble Model for Phishing Detection Based on Hybrid Cumulative Feature Selection,” in 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), 2021, pp. 7–12, doi: 10.1109/ISCAIE51753.2021.9431782. [10] A. Butnaru, A. Mylonas, and N. Pitropakis, “Towards lightweight url-based phishing detection,” Future Internet, vol. 13, no. 6, pp. 1–15, 2021, doi: 10.3390/fi13060154. [11] Ö. Kasim, “Automatic detection of phishing pages with event-based request processing, deep-hybrid feature extraction and light gradient boosted machine model,” Telecommunication Systems, vol. 78, no. 1, pp. 103–115, 2021, doi: 10.1007/s11235-021- 00799-6. [12] M. Korkmaz, E. Kocyigit, O. K. Sahingoz, and B. Diri, “Phishing Web Page Detection Using N-gram Features Extracted From URLs,” in 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), 2021, pp. 1–6, doi: 10.1109/HORA52670.2021.9461378. [13] C. Opara, B. Wei, and Y. Chen, “HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Tech- niques on HTML Analysis,” in 2020 International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–8, doi: 10.1109/IJCNN48605.2020.9207707. [14] W. Wei, Q. Ke, J. Nowak, M. Korytkowski, R. Scherer, and M. Woźniak, “Accurate and fast URL phishing detector: A convolutional neural network approach,” Computer Networks, vol. 178, pp. 1–13, 2020, doi: 10.1016/j.comnet.2020.107275. [15] S. Abdelnabi, K. Krombholz, and M. Fritz, “VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 1681–1698, doi: 10.1145/3372297.3417233. [16] S. Singh, M. P. Singh, and R. Pandey, “Phishing Detection from URLs Using Deep Learning Approach,” in 2020 5th International Conference on Computing, Communication and Security (ICCCS), 2020, pp. 1–4, doi: 10.1109/ICCCS49678.2020.9277459. [17] P. Yang, G. Zhao, and P. Zeng, “Phishing website detection based on multidimensional features driven by deep learning,” IEEE Access, vol. 7, pp. 15196–15209, 2019, doi: 10.1109/ACCESS.2019.2892066. [18] A. AlEroud and G. Karabatis, “Bypassing Detection of URL-based Phishing Attacks Using Generative Adversarial Deep Neu- ral Networks,” in Proceedings of the Sixth International Workshop on Security and Privacy Analytics, 2020, pp. 53–60, doi: 10.1145/3375708.3380315. [19] T. Feng and C. Yue, “Visualizing and Interpreting RNN Models in URL-based Phishing Detection,” in Proceedings of the 25th ACM Symposium on Access Control Models and Technologies, 2020, pp. 13–24, doi: 10.1145/3381991.3395602. [20] I. Saha, D. Sarma, R. J. Chakma, M. N. Alam, A. Sultana, and S. Hossain, “Phishing Attacks Detection using Deep Learning Approach,” in 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020, pp. 1180–1185, doi: 10.1109/ICSSIT48917.2020.9214132. [21] S. J. Bu and S. B. Cho, “Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing url detection,” Electronics, vol. 10, no. 12, pp. 1–16, 2021, doi: 10.3390/electronics10121492. [22] A. Safi and S. Singh, “A systematic literature review on phishing website detection techniques,” Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 2, pp. 590–611, 2023, doi: 10.1016/j.jksuci.2023.01.004. [23] B. B. Gupta, K. Yadav, I. Razzak, K. Psannis, A. Castiglione, and X. Chang, “A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment,” Computer Communications, vol. 175, pp. 47–57, 2021, doi: 10.1016/j.comcom.2021.04.023. [24] M. K. Prabakaran, P. M. Sundaram, and A. D. Chandrasekar, “An enhanced deep learning-based phishing detection mechanism to effectively identify malicious URLs using variational autoencoders,” IET Information Security, vol. 17, no. 3, pp. 423–440, 2023, doi: 10.1049/ise2.12106. [25] L. Bergamin, T. Carraro, M. Polato, and F. Aiolli, “Novel Applications for VAE-based Anomaly Detection Systems,” in 2022 Inter- national Joint Conference on Neural Networks (IJCNN), 2022, vol. 2022-July, pp. 1–8, doi: 10.1109/IJCNN55064.2022.9892879. [26] G. Qi and H. Yu, “CMVAE: Causal Meta VAE for Unsupervised Meta-Learning,” in The Thirty-Seventh AAAI Conference on Artificial Intelligenc, pp. 9480–9488, 2023, doi: 10.1609/aaai.v37i8.26135. [27] L. Zhang, P. Zhang, L. Liu, and J. Tan, “Multiphish: Multi-Modal Features Fusion Networks for Phishing Detection,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3520–3524, doi: 10.1109/ICASSP39728.2021.9415016. [28] P. Maneriker, J. W. Stokes, E. G. Lazo, D. Carutasu, F. Tajaddodianfar, and A. Gururajan, “URLTran: Improving Phishing URL Detection Using Transformers,” in MILCOM 2021 - 2021 IEEE Military Communications Conference (MILCOM), 2021, pp. 197–204, doi: 10.1109/MILCOM52596.2021.9653028. [29] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5999–6009. [30] S. Al-Ahmadi, A. Alotaibi, and O. Alsaleh, “PDGAN: Phishing Detection With Generative Adversarial Networks,” IEEE Access, vol. 10, pp. 42459–42468, 2022, doi: 10.1109/ACCESS.2022.3168235. Generative adversarial network-based phishing URL detection with variational ... (Jishnu Kaitholikkal Sasi)
  • 8. 2172 ❒ ISSN: 2252-8938 [31] A. Aljofey, Q. Jiang, Q. Qu, M. Huang, and J. P. Niyigena, “An effective phishing detection model based on character level convo- lutional neural network from URL,” Electronics, vol. 9, no. 9, pp. 1–24, 2020, doi: 10.3390/electronics9091514. [32] S. Ariyadasa, S. Fernando, and S. Fernando, “Combining Long-Term Recurrent Convolutional and Graph Convolutional Net- works to Detect Phishing Sites Using URL and HTML,” IEEE Access, vol. 10, pp. 82355–82375, 2022, doi: 10.1109/AC- CESS.2022.3196018. BIOGRAPHIES OF AUTHORS Jishnu Kaitholikkal Sasi is pursuing a Ph.D. in Computer Science from the SRM Institute of Science and Technology. He completed his B.Tech. in Computer Science from KMCT College of Engineering in 2017 and received an M.Tech. degree in Computer Science from Government En- gineering College, Wayanad, in 2021. He has good programming skills and is proficient in software engineering, deep learning, and blockchain. His research mainly focuses on detecting phishing URLs using deep learning. He can be contacted at email: js2963@srmist.edu.in. Arthi Balakrishnan holds a Ph.D. degree in the field of Computer Science and Engineer- ing from Anna University. She has 15+ years of experience in teaching. Her areas of interest include artificial intelligence, machine learning, software engineering, and Internet of Things. She has pub- lished several articles in various reputed national and international, such as Scopus and SCI journals. She has also authored book chapters in CRC and Springer Publications. She has presented papers in various national and international conferences and attended many workshops, seminars, and faculty development programs to be in track with the changing technology and teaching methodology. She is a member of various scientific and professional bodies. She has been awarded the IET Inspiring Young Teacher Award for the year 2016-2017 for the IET Chennai. She can be contacted at email: arthib@srmist.edu.in. Int J Artif Intell, Vol. 13, No. 2, June 2024: 2165–2172