SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 14, No. 1, February 2024, pp. 830~840
ISSN: 2088-8708, DOI: 10.11591/ijece.v14i1.pp830-840  830
Journal homepage: http://ijece.iaescore.com
Enhanced transformer long short-term memory framework for
datastream prediction
Nada Adel Dief, Mofreh Mohamed Salem, Asmaa Hamdy Rabie, Ali Ibrahim El-Desouky
Department of Computer and Control Systems Engineering, Faculty of Engineering, Mansoura University, Cairo, Egypt
Article Info ABSTRACT
Article history:
Received Aug 15, 2023
Revised Sep 26, 2023
Accepted Oct 9, 2023
In machine learning, datastream prediction is a challenging issue,
particularly when dealing with enormous amounts of continuous data. The
dynamic nature of data makes it difficult for traditional models to handle and
sustain real-time prediction accuracy. This research uses a multi-processor
long short-term memory (MPLSTM) architecture to present a unique
framework for datastream regression. By employing several central
processing units (CPUs) to divide the datastream into multiple parallel
chunks, the MPLSTM framework illustrates the intrinsic parallelism of long
short-term memory (LSTM) networks. The MPLSTM framework ensures
accurate predictions by skillfully learning and adapting to changing data
distributions. Extensive experimental assessments on real-world datasets
have demonstrated the clear superiority of the MPLSTM architecture over
previous methods. This study uses the transformer, the most recent deep
learning breakthrough technology, to demonstrate how well it can handle
challenging tasks and emphasizes its critical role as a cutting-edge approach
to raising the bar for machine learning.
Keywords:
Datastream
Long short-term memory
Machine learning
Multiprocessing pool
Parallel processing
Prediction accuracy
Transformer
This is an open access article under the CC BY-SA license.
Corresponding Author:
Nada Adel Dief
Department of Computer and Control Systems Engineering, Faculty of Engineering, Mansoura University
Mansoura, 35511, Egypt
Email: nadadief@mans.edu.eg
1. INTRODUCTION
In the era of big data, traditional machine learning methods can be computationally burdensome
and complex, making them unsuitable for processing such large-scale datasets. To achieve accurate
predictions for data, traditional machine learning techniques often struggle to handle the challenges posed
by big data, including its sheer volume, complexity, and high-dimensional nature [1]. On the other hand,
data-driven methods utilizing deep learning have attracted interest due to their capacity to perform
statistical analysis and information extraction automatically and successfully on large-scale, multi-source,
and high-dimensional data., thereby overcoming the limitations of traditional prediction methods [2].
Recurrent neural networks (RNNs) are a kind of neural network that is particularly good at handling
sequential data. have a feedback connection, in contrast to conventional feedforward neural networks,
which enables them to keep an internal memory of prior inputs. This memory enables RNNs to effectively
capture temporal dependencies and patterns in sequential data. However, traditional RNNs experience the
“vanishing gradient” problem, where the gradient signal weakens with time and makes it difficult to
adequately capture long-term dependencies. To overcome this restriction, variants like long short-term
memory (LSTM) and gated recurrent unit (GRU) were developed. Incorporating gating mechanisms that
selectively remember or forget information, these models are better able to capture and spread pertinent
information over longer sequences [3]. Additionally, in neural networks (NNs), the pre-assignment of
Int J Elec & Comp Eng ISSN: 2088-8708 
Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief)
831
parameters defines the network's topology and has an impact on how computationally intensive training
and prediction are. Therefore, optimizing the parameters is crucial for achieving excellent performance.
However, like other deep learning (DL) networks [4], LSTM also faces the challenge of parameter
selection, which often requires hand-engineered adjustments. Manual parameter adjustment is difficult,
particularly when dealing with vast amounts of data and very deep network structures. To address this
issue, a grid search (GS) is employed to look for the ideal settings for multi-processor long short-term
memory (MPLSTM), leading to predicting the datastream flow. This approach aims to build a suitable
model structure and increase the MPLSTM's prediction accuracy. A multi-processor LSTM framework for
real-time data stream processing is the main goal of this study. It conducts a comprehensive analysis of
MPLSTM using a real-life dataset, offering valuable insights into monitoring the parallel approach.
2. THE PROPOSED DATASTREAM MULTIPROCESSING LSTM FRAMEWORK
In this section, a framework for real datastream analysis is presented that harnesses the strengths of
MPLSTM along with other techniques to attain superior accuracy in real-time data processing. Figure 1
demonstrates the framework’s overall architecture and gives a visual representation of the intricate details
that underlie its operational procedures. The framework is made up of several parts, including an output
layer, a hidden layer, and a layer for data input.
Figure 1. The details of the proposed MPLSTM framework
2.1. Data acquisition and preprocessing
The pre-processing stage is essential for getting the data ready for input into MPLSTM. This phase
involves several steps, including data cleaning, normalization, splitting, and reshaping [5]. Firstly, data
cleaning is performed to remove any missing or inconsistent data that can adversely affect the model's
performance. Secondly, normalization, splitting, and reshaping the input data into a form that is compatible
with the LSTM unit to scale the data and bring it within a specified range to avoid bias in the model's
performance [6].
2.2. The learning model
After splitting the dataset into train and test, the training set is divided into chunks and processed in
parallel using the multiprocessing pool. The pool manages a pool of worker processes, automatically
assigning tasks to available workers and handling the communication between the main process and the
worker processes. After that, the proposed MPLSTM framework is trained with various hyperparameters
adjusted through multiple experiments until reaching a stable state, optimizing the weights with a grid search
algorithm for best performance.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840
832
2.2.1. Parallelization using multiprocessing pool
Parallelizing LSTM-based models using multiprocessing [7] enables faster processing of input
sequences, efficient resource utilization, scalability, and flexibility in model design and optimization. It can
be particularly useful in scenarios where large sequences or computationally intensive models need to be
processed within a reasonable time frame. To apply multiprocessing, there are some steps [8] that must be
followed: i) split the data into smaller chunks that can be processed independently, ii) create a function that
will be executed in parallel by multiple processes, iii) this function will take a data chunk as input and
perform LSTM processing on that chunk, iv) inside the function, create an instance of the LSTM model and
train or predict on the input chunk, v) set up a multiprocessing pool, vi) the pool manages a group of worker
processes that will execute the parallel processing function, vii) specify the number of worker processes to
utilize, typically based on the available hardware resources, viii) collect the results from the parallel
processes, and ix) however, it is important to note that the level of parallelism achievable depends on factors
such as the number of available CPU cores, memory capacity, and the size of the input sequence. Parallel
execution can be increased potentially by having a higher number of CPU cores, allowing for the execution
of more processes in parallel. Similarly, the presence of sufficient memory capacity is crucial to
accommodate the running of data and processes in parallel without being constrained by memory-related
issues.
2.2.2. Dropout layer
To increase the network's speed and sturdiness, dropout regularization has been incorporated [9]. A
regularization method is frequently employed in neural networks. It randomly deactivates a fraction of the
neurons in the previous layer during each training iteration. This dropout of neurons helps prevent overfitting
[10] by reducing the reliance of the network on specific neurons and encourages the learning of more robust
and generalizable representations. During inference, the dropout layer is typically turned off, and the full
network is used for making predictions. By incorporating dropout layers, the network becomes more resilient
to overfitting and can improve its generalization performance on unseen data [11], [12].
2.2.3. Dense layer
A fully connected layer is a fundamental component of a neural network. It consists of multiple
nodes, or neurons, where every neuron is linked to every other neuron in the layer below. An activation
function is applied to the weighted sum of the inputs from the layer before to determine each neuron's output
in a dense layer [13]. To maximize the network's performance on the specified task, these weights and biases
are learned throughout the training phase [14].
The model consists of several dense layers that are fully connected to all the activations in the
former layer. These dense layers combine the complicated feature maps to produce a feature vector that is
flattened. The occurrences are then categorized using the softmax [15] output probabilities produced by the
last dense layer.
2.2.4. Adam optimizer
This section highlights the importance of parameter optimization in improving the model's
performance. MPLSTM was trained using the Adam optimizer algorithm. A well-liked optimization
technique frequently employed in deep learning is the Adam optimizer [16]. It combines the advantages of
both the adaptive gradient algorithm (AdaGrad) and root mean square propagation (RMSProp) algorithms by
adapting the learning rate for each parameter individually. This adaptive learning rate helps in achieving
faster convergence as well as improved performance while training. The Adam optimizer is used to reduce
the categorical cross-entropy loss function with a learning rate of 10−4
. By using the Adam optimizer,
MPLSTM by successfully updating its weights and biases, can reduce the loss function and boost its
accuracy overall.
2.3. Prediction
In the prediction phase, the input data is fed forward through the network, The softmax function is
applied in the output layer to create a probability distribution over the classes [17]. The anticipated class label
is normally determined by the class with the highest probability. Softmax makes sure that the projected
probabilities are restricted between 0 and 1 and add up to 1. Because of this, it is appropriate for multi-class
classification problems in which each instance belongs to a single class.
𝑃𝑖 =
𝑒𝑥𝑝(𝑍𝑖)
∑ 𝑒𝑥𝑝(𝑍𝑗)
𝑛
𝑗=1
(1)
Int J Elec & Comp Eng ISSN: 2088-8708 
Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief)
833
where n is the total number of classes, 𝑍𝑖 is the raw output value for class 𝑖. By applying softmax, the neural
network can provide a probability-based prediction, allowing for decision-making based on the highest
probability class.
2.4. Evaluation
Datastreams often exhibit changes in the class distribution of incoming instances regularly. These
metrics provide a more comprehensive assessment of the model's performance, considering the evolving
nature of the data stream and allowing for timely adaptation and monitoring. To evaluate the results,
MPLSTM uses the following evaluation metrics: classification accuracy: it compares the predictions of
MPLSTM with the actual target values from the dataset [18].
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
(2)
Where 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠: This refers to the count of instances where MPLSTM correctly
predicts the target value, 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠: This is the total number of instances for which
predictions were made by the MPLSTM. The result will be a value between 0 and 1, representing the
percentage of correct predictions made by MPLSTM.
Then, three error metrics mean square error (MSE), root mean squared error (RMSE), and mean
absolute error (MAE) Loss are used to assess the model's performance. These measures are employed to
evaluate several facets of the model's precision and prognostication [19].
𝑀𝑆𝐸 =
1
𝑛
∑ (𝑦𝑡 − 𝑦
̅𝑡)2
𝑛
𝑖=1 (3)
𝑅𝑀𝑆𝐸 = √
1
𝑛
∑ (y𝑡 − 𝑦
̅𝑡)2
𝑛
𝑖=1 (4)
𝑀𝐴𝐸 =
1
𝑛
∑ |(y𝑡 − 𝑦
̅𝑡)|
𝑛
𝑖=1 (5)
Where n is the number of samples, y𝑡, 𝑦
̅𝑡 the predicted and actual values respectively. RMSE measures the
average magnitude of the prediction errors by taking the square root of the mean squared difference between
the predicted y𝑡 and actual values 𝑦
̅𝑡. It gindicateshow accurately the model predicts the desired variable.
While MAE measures the average absolute difference between y𝑡, 𝑦
̅𝑡.
3. LSTM enhancement
Inspired by the transformer model's innovations [20], we enhance LSTM by incorporating
transformer principles. This fusion includes self-attention and cross-attention mechanisms [21] similar to
transformers, improving LSTM's ability to capture complex data dependencies, especially in large datasets.
The resulting TransLSTM architecture combines LSTM and transformer strengths, making it adaptable and
powerful for real-world applications and predictions. Figure 2 illustrates TransLSTM: input encoding
converts tokens to continuous vectors, positional encodings provide context and positional information,
transformer encoder blocks process sequences with multi-head self-attention and feedforward networks,
LSTM Integration captures sequential dependencies, attention mechanism combines information from both
sources, and the output layer produces final predictions.
Figure 2. TransLSTM architecture
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840
834
4. RESULTS AND DISCUSSION
This section outlines the comparative study conducted to assess the performance of MPLSTM. The
experimental procedure employs statistical analysis to evaluate the results obtained across all datasets,
comparing MPLSTM with several state-of-the-art algorithms for data stream classification. The results of this
study provide important new information about the performance of the proposed MPLSTM framework and
its competitive position in the field of data stream classification techniques.
4.1. Dataset
A total of 29 different time-series datasets were used in this study and came from the UCR
repository, which is accessible to the public [22]. Stream clustering [23], anomaly detection [24], and data
stream density estimation are just a few of the applications for which these datasets have been used in
research in the past. Each dataset comprises of instances of a one-dimensional time series with a built-in grid
structure.
The IMDB dataset [25], introduced by Maas et al. [26], is a prominent benchmark for sentiment
classification. It comprises 25,000 reviews in both the training and test sets, each limited to 30 reviews per
movie for diversity. This balanced dataset contains an equal number of positive and negative reviews,
establishing a 50% accuracy baseline if predictions were random.
4.2. Case study 1
The study encompassed a comprehensive exploration of various established techniques, aiming to
encompass all algorithm families proposed in the literature for the given problem. Table 1 provides an
overview of the evaluated classifiers, organized by their respective families, and includes the abbreviations
used throughout this paper [27], [28]. The results obtained from the conducted experiments are presented and
discussed. Additionally, the processing time on each dataset is analyzed, considering the significance of
speed-up in a data streaming scenario.
Table 1. Utilized models for case study1
Classifier Abbreviation Family
Naive Bayes NaivBy Bayesian classifiers
Adaptive Size H T AdptSHOFT Decision tree
Stochastic gradient-descent StoGrdD Function classifiers
Single classifier drift SnglCDrft Drift classifiers
Leveraging bagging LvrgBag Meta classifiers bagging
Adaptive random forest AdptRnF Meta classifiers bagging
Boosting using adwin BoAdwin Meta classifiers boosting
Multi-layer perception MLPrecept Neural networks
Hyperparameter selection typically involves using rule-of-thumb parameters or proven combinations
from previous studies. However, a systematic approach like grid search (GS) [29] is employed for meticulous
hyperparameter selection. Grid search is favored due to its simplicity, parallelizability, and effectiveness in
low-dimensional spaces. It entails discretizing hyperparameter value ranges and systematically testing all
possible combinations. This approach explores diverse model configurations. Before training the final
models, a validation run optimizes hyperparameters based on accuracy assessments. The training process
ends when the maximum epoch limit is reached. MPLSTM configuration details are summarized in Table 2.
In this paper, the Adam optimizer is chosen post-validation for its computational efficiency and slightly
superior test results. A batch size of 32 is used for all models, and the sparse categorical cross entropy as a
loss function [30] is employed. This loss function calculates the negative logarithm of the predicted
probability for the true class index when applied to class indices, showing the model's level of assurance in
the accuracy of its class prediction.
Table 2. Hyperparameters used in tuning MPLSTM framework
Network Parameter Configuration
Dense 10
Epochs number 200
Optimization function ADAM optimizer
Size of a batch 32
Learning rate 0.001
Activation function Softmax
Loss function Sparse categorical cross entropy
Int J Elec & Comp Eng ISSN: 2088-8708 
Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief)
835
𝑆𝑝𝑎𝑟𝑠𝑒 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙 𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ∑ 𝑦𝑖 𝑙𝑜𝑔(𝑦𝑖
̂)
𝑛
𝑖=1 (6)
where n represents the classes’ number, 𝑦 represents the true label or target value of the ith class, and 𝑦
̂
represents the predicted probability for the corresponding class.
4.3. Performance evaluation
Batch sizes above 30 exhibit stable accuracy and processing times regardless of the number of
batches, offering flexibility in parameter selection. However, batch sizes below 30 significantly degrade
performance, hindering model adaptability to evolving data streams. Very small batch sizes overly focus on
individual examples, preventing learning of overall data distribution changes. MPLSTM achieves high
accuracy across various datasets, showcasing LSTM's suitability for time-series data streaming.
Convergence of training and validation loss lines during model training is a positive indicator,
signifying learning progress. MPLSTM reduces processing time significantly through parallel processing,
enhancing accuracy and predictive capabilities. The trade-off with increased computational time should be
considered based on application requirements. Figure 3 demonstrates parallel processing consistently
outperforming sequential processing across 29 datasets, ensuring MPLSTM's effectiveness.
Figure 3. A comparison between sequential and parallel execution processing times
The FacesUCR dataset exhibits a speedup of 2 times when processed in parallel, indicating a
significant improvement in processing time compared to sequential processing. Similarly, for the Pendigits
dataset, the parallel model achieves a speedup of 1.7, further demonstrating the efficiency of parallel processing
over the sequential approach in this case. Several other datasets, such as PhalangesOutlinesCorrect and
TwoPatterns, also exhibit notable speedups larger than 1.5 times when processed in parallel. These findings
further emphasize the effectiveness of MPLSTM in reducing processing time. The observed speedup across
multiple datasets as shown in Figure 4 underscores the model's ability to leverage parallelism efficiently,
resulting in faster dataset processing.
By harnessing parallel processing, the MPLSTM demonstrates its capability to significantly improve
performance and expedite data analysis tasks. Upon closer analysis, the processing time of individual
datasets, such as the ECG5000, demonstrates notable improvements in processing time as shown in Figure 5,
although the magnitude of the speedup may not be extremely high. However, when examining the Pendigits
dataset, with its larger size and increased complexity, the benefits of parallel processing become increasingly
pronounced. Consequently, the speedup achieved becomes substantially larger.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840
836
Figure 4. Dataset speedup curve when applying the MPLSTM framework
Figure 5. A Comparison of the processing time of sequential and parallel execution of two datasets ECG5000
and Pendigits
Figure 6 displays the learning curves, which depict the accuracy improvement across each epoch for
two different datasets ECG5000 and Pendigits. These curves visually demonstrate how the model's accuracy
increases as the training progresses. The learning curves for the two datasets show how well the model was
trained and how well it could learn from the data. The consistent improvement in accuracy over the course of
the epochs implies that the model is not just remembering the training data but also generalizing well to new
data. This is encouraging for the model's ability to predict outcomes using fresh data.
Similarly, Figure 7 illustrates the decrease in loss across each epoch for the same mentioned datasets.
All these learning curves provide crucial insights into the model's performance and offer valuable guidance for
enhancing its architecture and training procedure. The learning curves provide valuable insights into the model's
performance and offer guidance for optimizing its architecture and training process. By addressing overfitting
and considering early stopping, the model's accuracy can be further improved while maintaining good
generalization capabilities.
Int J Elec & Comp Eng ISSN: 2088-8708 
Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief)
837
In Table 3, The proposed MPLSTM framework's effectiveness was assessed using MSE, RMSE,
and MAE as evaluation metrics. It is desirable to have low values for these metrics as they indicate better
performance. In this study, the framework yielded promising results with low values for example, it gives
MSE=0.237 for the ECG5000, RMSE=0.583 for the PhalangesOutlinesCorrect, and MAE=0.074 for the
pendigits. This implies that the predictions made by the MPLSTM model were close to the actual values. The
model performed well because it was able to learn the patterns and relationships in the data. This led to
accurate predictions and shows that the MPLSTM framework is an effective way to address this problem.
Table 4, The accuracy table showcases the performance evaluation results for MPLSTM on the UCR dataset.
It provides a comprehensive overview of the accuracy achieved by the framework in predicting the target
variable. The table offers a detailed breakdown of the accuracy scores across different metrics or
experimental configurations, allowing for a comprehensive analysis of the framework's performance.
Researchers and practitioners can refer to this table to assess the effectiveness and reliability of MPLSTM in
accurately predicting the target variable on the UCR dataset.
Figure 6. The Accuracy curves of training and validation sets in two datasets ECG5000 and Pendigits
Figure 7. The loss curves of training and validation sets of two datasets ECG5000 and Pendigits
Table 3. Performance of MPLSTM in terms of MSE, RMSE, and MAE
Dataset MSE RMSE MAE
Wafer 0.449 0.670 0.2247
Pendigits 0.390 0.625 0.074
ECG5000 0.237 0.478 0.113
HandOutlines 0.361 0.601 0.361
PhalangesOutlinesCorrect 0.340 0.583 0.340
ChlorineConcentration 1.11 1.0536 0.340
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840
838
Table 4. Accuracy of the top 7 classifiers for the UCR datasets compared with the MPLSTM framework
Dataset Proposed
DPMLSTM
AdaptRnf MLPrecept NaivBy SnglCDrft AdaptSHOFT LvrgBag BoAdwin StoGrdD
Wafer 0.100 0.982 0.991 0.194 0.192 0.356 0.963 0.960 0.542
Pendigits 0.976 0.950 0.938 0.824 0.784 0.850 0.867 0.909 0.800
ECG5000 0.941 0.856 0.877 0.750 0.772 0.750 0.752 0.843 0.833
ElectricDevices 0.85 0.526 0.526 0.456 0.456 0.456 0.468 0.457 0.194
HandOutlines 0.638 0.720 0.634 0.533 0.533 0.533 0.530 -0.084 0.475
PhalangesOutlinesCorrect 0.659 0.377 0.060 0.134 0.134 0.133 0.245 0.277 0.072
ChlorineConcentration 0.533 0.149 0.082 0.122 0.122 0.001 0.063 0.001 0.099
4.5. Case study 2
In this study, LSTM is integrated and the Transformer model to create TransLSTM, a novel
architecture. TransLSTM leverages the Transformer's success in handling sequential data and capturing long-
range dependencies. This fusion enhances LSTM's ability to model complex relationships and temporal
dependencies in sequential data by incorporating self-attention and cross-attention mechanisms from the
Transformer. The investigation demonstrates how TransLSTM can address LSTM's limitations, potentially
leading to more accurate and efficient predictions. This case study highlights the innovative potential of
combining diverse neural architectures for enhanced predictive capabilities.
4.6. TransLSTM evaluation
The training history curves in the provided case study offer insights into the performance of two
different models, LSTM and TransLSTM, across multiple epochs. In Figure 8, the first and the third curves
represent training loss for LSTM and TransLSTM, while the second and fourth curves represent validation
loss for LSTM and TransLSTM, respectively. These curves depict the evolution of training loss over epochs,
showing a decreasing trend, and indicating learning from the training data. TransLSTM consistently achieves
lower training loss and outperforms LSTM in validation loss, indicating better generalization to new data.
Similarly, the other figure displays training and validation accuracy curves, with the first and the third curves
representing training accuracy for LSTM and TransLSTM, and the second and the fourth curves representing
validation accuracy. Both models exhibit an increasing trend in training accuracy, demonstrating efficient
learning from the training data as well as the capacity to generalize to new, untried data. TransLSTM
achieves higher training and validation accuracy, highlighting its superior data modeling capabilities.
Figure 8. Comparison between LSTM and TransLSTM training and validation loss and accuracy
5. CONCLUSION
In conclusion, this paper presents a novel framework for datastream regression, referred to as
MPLSTM. The proposed framework effectively addresses the challenges associated with handling
continuous and large-scale data in real-time prediction scenarios. By leveraging the inherent parallelism of
LSTM networks, MPLSTM achieves a remarkable balance between high prediction accuracy and
Int J Elec & Comp Eng ISSN: 2088-8708 
Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief)
839
computational efficiency. Experimental evaluations, conducted on real-world datasets including the UCR
dataset, validate the superior performance of MPLSTM compared to traditional regression models. The
framework's ability to capture temporal dependencies and long-term patterns in streaming data is
demonstrated through accurate predictions, as evidenced by accuracy measures and loss calculations.
MPLSTM emerges as a promising approach for datastream prediction, showcasing improved performance
and outperforming existing results in terms of accuracy and loss.
REFERENCES
[1] S. Bharany et al., “A comprehensive review on big data challenges,” in 2023 International Conference on Business Analytics for
Technology and Security (ICBATS), Mar. 2023, pp. 1–7, doi: 10.1109/ICBATS57792.2023.10111375.
[2] S. Homayoun and M. Ahmadzadeh, “A review on data stream classification approaches,” Journal of Advanced Computer Science
and Technology, vol. 5, no. 1, Feb. 2016, doi: 10.14419/jacst.v5i1.5225.
[3] S. Ray, “A quick review of machine learning algorithms,” in 2019 International Conference on Machine Learning, Big Data,
Cloud and Parallel Computing (COMITCon), Feb. 2019, pp. 35–39, doi: 10.1109/COMITCon.2019.8862451.
[4] F. Karim, S. Majumdar, H. Darabi, and S. Chen, “LSTM fully convolutional networks for time series classification,” IEEE
Access, vol. 6, pp. 1662–1669, 2018, doi: 10.1109/ACCESS.2017.2779939.
[5] S. Smyl and K. Kuber, “Data preprocessing and augmentation for multiple short time series forecasting with recurrent neural
networks,” 36th International Symposium on Forecasting, 2016.
[6] I. O. Muraina, “Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data
analysts,” in 7th International Mardin Artuklu Scientific Researches Conference, 2022, pp. 496–504.
[7] J. Hunt, “Multiprocessing,” in Advanced Guide to Python 3 Programming, Springer International Publishing, 2019, pp. 363–376.
[8] Z. A. Aziz, D. Naseradeen Abdulqader, A. B. Sallow, and H. Khalid Omer, “Python parallel processing and multiprocessing: a
rivew,” Academic Journal of Nawroz University, vol. 10, no. 3, pp. 345–354, Aug. 2021, doi: 10.25007/ajnu.v10n3a1145.
[9] X. Liang et al., “R-Drop: regularized dropout for neural networks,” Advances in Neural Information Processing Systems, vol. 13,
pp. 10890–10905, 2021.
[10] X. Ying, “An overview of overfitting and its solutions,” Journal of Physics: Conference Series, vol. 1168, no. 2, Feb. 2019, doi:
10.1088/1742-6596/1168/2/022022.
[11] N. Watt and M. C. du Plessis, “Dropout for recurrent neural networks,” in Proceedings of the International Neural Networks
Society, Springer International Publishing, 2020, pp. 38–47.
[12] A. Zunino, S. A. Bargal, P. Morerio, J. Zhang, S. Sclaroff, and V. Murino, “Excitation dropout: encouraging plasticity in deep
neural networks,” International Journal of Computer Vision, vol. 129, no. 4, pp. 1139–1152, Jan. 2021, doi: 10.1007/s11263-020-
01422-y.
[13] D. Jha, A. Yazidi, M. A. Riegler, D. Johansen, H. D. Johansen, and P. Halvorsen, “LightLayers: parameter efficient dense and
convolutional layers for image classification,” in Lecture Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12606, Springer International Publishing, 2021, pp. 285–296.
[14] P. Lara-Benítez, M. Carranza-García, D. Gutiérrez-Avilés, and J. C. Riquelme, “Data streams classification using deep learning
under different speeds and drifts,” Logic Journal of the IGPL, vol. 31, no. 4, pp. 688–700, Jul. 2023, doi: 10.1093/jigpal/jzac033.
[15] O. Du, Y. Zhang, X. Li, J. Zhu, T. Zheng, and Y. Li, “Multi-view heterogeneous network embedding,” in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13369
LNAI, Springer International Publishing, 2022, pp. 3–15.
[16] Z. Zhang, “Improved Adam optimizer for deep neural networks,” in 2018 IEEE/ACM 26th International Symposium on Quality of
Service (IWQoS), Jun. 2018, pp. 1–2, doi: 10.1109/IWQoS.2018.8624183.
[17] Y. Gao, W. Liu, and F. Lombardi, “Design and implementation of an approximate softmax layer for deep neural networks,” 2020
IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 2020, pp. 1-5, doi:
10.1109/iscas45731.2020.9180870.
[18] J. Gama, R. Sebastião, and P. P. Rodrigues, “On evaluating stream learning algorithms,” Machine Learning, vol. 90, no. 3,
pp. 317–346, Oct. 2013, doi: 10.1007/s10994-012-5320-9.
[19] T. Chai and R. R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)? -Arguments against avoiding RMSE
in the literature,” Geoscientific Model Development, vol. 7, no. 3, pp. 1247–1250, Jun. 2014, doi: 10.5194/gmd-7-1247-2014.
[20] A. Vaswani et al., “Attention is all you need,” arXiv:1706.03762, Jun. 2017.
[21] Z. Huang, P. Xu, D. Liang, A. Mishra, and B. Xiang, “TRANS-BLSTM: transformer with bidirectional LSTM for language
understanding,” arXiv:2003.07000, Mar. 2020.
[22] Y. Chen et al., “The UCR time series classification archive,” NSF, Jul. 2015, https://www.cs.ucr.edu/~eamonn/time_series_data/
(accessed Jul. 13, 2023)
[23] A. Bifet, G. De Francisci Morales, J. Read, G. Holmes, and B. Pfahringer, “Efficient online evaluation of big data stream
classifiers,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug.
2015, pp. 59–68, doi: 10.1145/2783258.2783372.
[24] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection for discrete sequences: a survey,” IEEE Transactions on
Knowledge and Data Engineering, vol. 24, no. 5, pp. 823–839, May 2012, doi: 10.1109/TKDE.2010.235.
[25] IMDb, “Internet movie database,” IMDb datasets. https://datasets.imdbws.com/ (accessed Jul. 13, 2023).
[26] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011,
vol. 1, pp. 142–150.
[27] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: massive online analysis,” Journal of Machine Learning Research,
vol. 11, pp. 1601–1604, 2010.
[28] P. Fabian et al., “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[29] B. H. Shekar and G. Dagnew, “Grid search-based hyperparameter tuning and classification of microarray cancer data,” in 2019
Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), Feb. 2019, pp. 1–8,
doi: 10.1109/ICACCP.2019.8882943.
[30] S. Mannor, D. Peleg, and R. Rubinstein, “The cross entropy method for classification,” in Proceedings of the 22nd
International
Conference on Machine Learning, 2005, pp. 561–568, doi: 10.1145/1102351.1102422.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840
840
BIOGRAPHIES OF AUTHORS
Nada Adel Dief is a computer engineer interested in big data, deep learning,
machine learning, and text mining. She received her master of science degree from Mansoura
University in 2016 and is currently working as a teaching assistant for the Faculty of
Engineering Computer and System Department. Department of Computer Engineering and
Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt. She can be
contacted at email: nadadief@mans.edu.eg.
Mofreh Mohamed Salem received his Ph.D. degree from Strathclyde University,
U.K., in 1985. He was the director of the Software Engineering Unit, Faculty of Engineering,
from 2001 to 2006. He was the head of the Computers Engineering and Control Department,
Faculty of Engineering, Mansoura University, Egypt, from 2004 to 2008, where he is
currently a member of the Computer Center Council. He was the Dean of the High Institute
for Computers in Mansoura, from 2008 to 2011. He has published 92 scientific articles in
international journals periodicals and conferences of computer engineering. His current
research interests include software engineering, computer systems design, parallel processing,
computer networks, cloud computing, and big data. Department of Computer Engineering and
Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt. He can be
contacted at email: dr_mofreh@mans.edu.eg.
Asmaa Hamdy Rabie received a B.Sc. in computers and systems engineering,
with a general grade of excellent with class honors in 2013. She got her master's degree in the
area of load forecasting using data mining techniques in 2016 at the Computers Engineering
and System Department, Mansoura University, Egypt. She got her Ph.D. degree in load
forecasting using data mining techniques in 2020 at Computers Engineering and System
Department, Mansoura University, Egypt. Her interests (programming languages,
classification, big data, data mining, healthcare systems, and the internet of things), she is
currently a lecturer in the faculty of Engineering, at Mansoura University, Egypt. She can be
contacted at email: asmaa91hamdy@yahoo.com.
Ali Ibrahim EL-Desouky holds received his MSc and Ph.D. degrees from the
University of Glasgow, USA. He is currently a full professor with the Computers Engineering
and Systems Department, Faculty of Engineering, Mansoura University, Egypt. He is also a
visiting part-time professor with MET Academy. He also teaches at American and Mansoura
universities and has taken over many positions of leadership and supervision of many
scientific articles. He has published hundreds of articles in well-known international journals.
Department of Computer Engineering and Systems, Faculty of Engineering, Mansoura
University, Mansoura, Egypt. Email: adesoky@mans.edu.eg.

More Related Content

PDF
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
PDF
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
PDF
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
PDF
IEEE Networking 2016 Title and Abstract
PDF
A data estimation for failing nodes using fuzzy logic with integrated microco...
PDF
Parallel and distributed system projects for java and dot net
PPT
PPT - Ph.D. Semester Progress Review 3.ppt
PDF
M010237578
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
IEEE Networking 2016 Title and Abstract
A data estimation for failing nodes using fuzzy logic with integrated microco...
Parallel and distributed system projects for java and dot net
PPT - Ph.D. Semester Progress Review 3.ppt
M010237578

Similar to Enhanced transformer long short-term memory framework for datastream prediction (20)

PDF
QoS_Aware_Replica_Control_Strategies_for_Distributed_Real_time_dbms.pdf
PDF
Hardback solution to accelerate multimedia computation through mgp in cmp
PDF
shashank_spdp1993_00395543
PDF
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
PDF
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
PDF
Ax34298305
PDF
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
PDF
ICICCE0298
PPTX
DEEP LEARNING (UNIT 2 ) by surbhi saroha
DOCX
Ns2 2015 2016 ieee project list-(v)_with abstract(S3 Infotech:9884848198)
PDF
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
PDF
Data Distribution Handling on Cloud for Deployment of Big Data
PDF
Data Distribution Handling on Cloud for Deployment of Big Data
PDF
Intrusion Detection System using K-Means Clustering and SMOTE
PDF
Performance assessment of time series forecasting models for simple network m...
DOCX
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
DOCX
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
PDF
Towards a low cost etl system
PDF
An octa core processor with shared memory and message-passing
PDF
Distributeddatabasesforchallengednet
QoS_Aware_Replica_Control_Strategies_for_Distributed_Real_time_dbms.pdf
Hardback solution to accelerate multimedia computation through mgp in cmp
shashank_spdp1993_00395543
On The Performance of Intrusion Detection Systems with Hidden Multilayer Neur...
ON THE PERFORMANCE OF INTRUSION DETECTION SYSTEMS WITH HIDDEN MULTILAYER NEUR...
Ax34298305
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...
ICICCE0298
DEEP LEARNING (UNIT 2 ) by surbhi saroha
Ns2 2015 2016 ieee project list-(v)_with abstract(S3 Infotech:9884848198)
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
Data Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big Data
Intrusion Detection System using K-Means Clustering and SMOTE
Performance assessment of time series forecasting models for simple network m...
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Towards a low cost etl system
An octa core processor with shared memory and message-passing
Distributeddatabasesforchallengednet
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...
Ad

Recently uploaded (20)

PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Current and future trends in Computer Vision.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Sustainable Sites - Green Building Construction
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPT
Mechanical Engineering MATERIALS Selection
PDF
PPT on Performance Review to get promotions
DOCX
573137875-Attendance-Management-System-original
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
composite construction of structures.pdf
PDF
Digital Logic Computer Design lecture notes
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Current and future trends in Computer Vision.pptx
OOP with Java - Java Introduction (Basics)
Sustainable Sites - Green Building Construction
Operating System & Kernel Study Guide-1 - converted.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Mechanical Engineering MATERIALS Selection
PPT on Performance Review to get promotions
573137875-Attendance-Management-System-original
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
composite construction of structures.pdf
Digital Logic Computer Design lecture notes
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf

Enhanced transformer long short-term memory framework for datastream prediction

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 14, No. 1, February 2024, pp. 830~840 ISSN: 2088-8708, DOI: 10.11591/ijece.v14i1.pp830-840  830 Journal homepage: http://ijece.iaescore.com Enhanced transformer long short-term memory framework for datastream prediction Nada Adel Dief, Mofreh Mohamed Salem, Asmaa Hamdy Rabie, Ali Ibrahim El-Desouky Department of Computer and Control Systems Engineering, Faculty of Engineering, Mansoura University, Cairo, Egypt Article Info ABSTRACT Article history: Received Aug 15, 2023 Revised Sep 26, 2023 Accepted Oct 9, 2023 In machine learning, datastream prediction is a challenging issue, particularly when dealing with enormous amounts of continuous data. The dynamic nature of data makes it difficult for traditional models to handle and sustain real-time prediction accuracy. This research uses a multi-processor long short-term memory (MPLSTM) architecture to present a unique framework for datastream regression. By employing several central processing units (CPUs) to divide the datastream into multiple parallel chunks, the MPLSTM framework illustrates the intrinsic parallelism of long short-term memory (LSTM) networks. The MPLSTM framework ensures accurate predictions by skillfully learning and adapting to changing data distributions. Extensive experimental assessments on real-world datasets have demonstrated the clear superiority of the MPLSTM architecture over previous methods. This study uses the transformer, the most recent deep learning breakthrough technology, to demonstrate how well it can handle challenging tasks and emphasizes its critical role as a cutting-edge approach to raising the bar for machine learning. Keywords: Datastream Long short-term memory Machine learning Multiprocessing pool Parallel processing Prediction accuracy Transformer This is an open access article under the CC BY-SA license. Corresponding Author: Nada Adel Dief Department of Computer and Control Systems Engineering, Faculty of Engineering, Mansoura University Mansoura, 35511, Egypt Email: nadadief@mans.edu.eg 1. INTRODUCTION In the era of big data, traditional machine learning methods can be computationally burdensome and complex, making them unsuitable for processing such large-scale datasets. To achieve accurate predictions for data, traditional machine learning techniques often struggle to handle the challenges posed by big data, including its sheer volume, complexity, and high-dimensional nature [1]. On the other hand, data-driven methods utilizing deep learning have attracted interest due to their capacity to perform statistical analysis and information extraction automatically and successfully on large-scale, multi-source, and high-dimensional data., thereby overcoming the limitations of traditional prediction methods [2]. Recurrent neural networks (RNNs) are a kind of neural network that is particularly good at handling sequential data. have a feedback connection, in contrast to conventional feedforward neural networks, which enables them to keep an internal memory of prior inputs. This memory enables RNNs to effectively capture temporal dependencies and patterns in sequential data. However, traditional RNNs experience the “vanishing gradient” problem, where the gradient signal weakens with time and makes it difficult to adequately capture long-term dependencies. To overcome this restriction, variants like long short-term memory (LSTM) and gated recurrent unit (GRU) were developed. Incorporating gating mechanisms that selectively remember or forget information, these models are better able to capture and spread pertinent information over longer sequences [3]. Additionally, in neural networks (NNs), the pre-assignment of
  • 2. Int J Elec & Comp Eng ISSN: 2088-8708  Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief) 831 parameters defines the network's topology and has an impact on how computationally intensive training and prediction are. Therefore, optimizing the parameters is crucial for achieving excellent performance. However, like other deep learning (DL) networks [4], LSTM also faces the challenge of parameter selection, which often requires hand-engineered adjustments. Manual parameter adjustment is difficult, particularly when dealing with vast amounts of data and very deep network structures. To address this issue, a grid search (GS) is employed to look for the ideal settings for multi-processor long short-term memory (MPLSTM), leading to predicting the datastream flow. This approach aims to build a suitable model structure and increase the MPLSTM's prediction accuracy. A multi-processor LSTM framework for real-time data stream processing is the main goal of this study. It conducts a comprehensive analysis of MPLSTM using a real-life dataset, offering valuable insights into monitoring the parallel approach. 2. THE PROPOSED DATASTREAM MULTIPROCESSING LSTM FRAMEWORK In this section, a framework for real datastream analysis is presented that harnesses the strengths of MPLSTM along with other techniques to attain superior accuracy in real-time data processing. Figure 1 demonstrates the framework’s overall architecture and gives a visual representation of the intricate details that underlie its operational procedures. The framework is made up of several parts, including an output layer, a hidden layer, and a layer for data input. Figure 1. The details of the proposed MPLSTM framework 2.1. Data acquisition and preprocessing The pre-processing stage is essential for getting the data ready for input into MPLSTM. This phase involves several steps, including data cleaning, normalization, splitting, and reshaping [5]. Firstly, data cleaning is performed to remove any missing or inconsistent data that can adversely affect the model's performance. Secondly, normalization, splitting, and reshaping the input data into a form that is compatible with the LSTM unit to scale the data and bring it within a specified range to avoid bias in the model's performance [6]. 2.2. The learning model After splitting the dataset into train and test, the training set is divided into chunks and processed in parallel using the multiprocessing pool. The pool manages a pool of worker processes, automatically assigning tasks to available workers and handling the communication between the main process and the worker processes. After that, the proposed MPLSTM framework is trained with various hyperparameters adjusted through multiple experiments until reaching a stable state, optimizing the weights with a grid search algorithm for best performance.
  • 3.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840 832 2.2.1. Parallelization using multiprocessing pool Parallelizing LSTM-based models using multiprocessing [7] enables faster processing of input sequences, efficient resource utilization, scalability, and flexibility in model design and optimization. It can be particularly useful in scenarios where large sequences or computationally intensive models need to be processed within a reasonable time frame. To apply multiprocessing, there are some steps [8] that must be followed: i) split the data into smaller chunks that can be processed independently, ii) create a function that will be executed in parallel by multiple processes, iii) this function will take a data chunk as input and perform LSTM processing on that chunk, iv) inside the function, create an instance of the LSTM model and train or predict on the input chunk, v) set up a multiprocessing pool, vi) the pool manages a group of worker processes that will execute the parallel processing function, vii) specify the number of worker processes to utilize, typically based on the available hardware resources, viii) collect the results from the parallel processes, and ix) however, it is important to note that the level of parallelism achievable depends on factors such as the number of available CPU cores, memory capacity, and the size of the input sequence. Parallel execution can be increased potentially by having a higher number of CPU cores, allowing for the execution of more processes in parallel. Similarly, the presence of sufficient memory capacity is crucial to accommodate the running of data and processes in parallel without being constrained by memory-related issues. 2.2.2. Dropout layer To increase the network's speed and sturdiness, dropout regularization has been incorporated [9]. A regularization method is frequently employed in neural networks. It randomly deactivates a fraction of the neurons in the previous layer during each training iteration. This dropout of neurons helps prevent overfitting [10] by reducing the reliance of the network on specific neurons and encourages the learning of more robust and generalizable representations. During inference, the dropout layer is typically turned off, and the full network is used for making predictions. By incorporating dropout layers, the network becomes more resilient to overfitting and can improve its generalization performance on unseen data [11], [12]. 2.2.3. Dense layer A fully connected layer is a fundamental component of a neural network. It consists of multiple nodes, or neurons, where every neuron is linked to every other neuron in the layer below. An activation function is applied to the weighted sum of the inputs from the layer before to determine each neuron's output in a dense layer [13]. To maximize the network's performance on the specified task, these weights and biases are learned throughout the training phase [14]. The model consists of several dense layers that are fully connected to all the activations in the former layer. These dense layers combine the complicated feature maps to produce a feature vector that is flattened. The occurrences are then categorized using the softmax [15] output probabilities produced by the last dense layer. 2.2.4. Adam optimizer This section highlights the importance of parameter optimization in improving the model's performance. MPLSTM was trained using the Adam optimizer algorithm. A well-liked optimization technique frequently employed in deep learning is the Adam optimizer [16]. It combines the advantages of both the adaptive gradient algorithm (AdaGrad) and root mean square propagation (RMSProp) algorithms by adapting the learning rate for each parameter individually. This adaptive learning rate helps in achieving faster convergence as well as improved performance while training. The Adam optimizer is used to reduce the categorical cross-entropy loss function with a learning rate of 10−4 . By using the Adam optimizer, MPLSTM by successfully updating its weights and biases, can reduce the loss function and boost its accuracy overall. 2.3. Prediction In the prediction phase, the input data is fed forward through the network, The softmax function is applied in the output layer to create a probability distribution over the classes [17]. The anticipated class label is normally determined by the class with the highest probability. Softmax makes sure that the projected probabilities are restricted between 0 and 1 and add up to 1. Because of this, it is appropriate for multi-class classification problems in which each instance belongs to a single class. 𝑃𝑖 = 𝑒𝑥𝑝(𝑍𝑖) ∑ 𝑒𝑥𝑝(𝑍𝑗) 𝑛 𝑗=1 (1)
  • 4. Int J Elec & Comp Eng ISSN: 2088-8708  Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief) 833 where n is the total number of classes, 𝑍𝑖 is the raw output value for class 𝑖. By applying softmax, the neural network can provide a probability-based prediction, allowing for decision-making based on the highest probability class. 2.4. Evaluation Datastreams often exhibit changes in the class distribution of incoming instances regularly. These metrics provide a more comprehensive assessment of the model's performance, considering the evolving nature of the data stream and allowing for timely adaptation and monitoring. To evaluate the results, MPLSTM uses the following evaluation metrics: classification accuracy: it compares the predictions of MPLSTM with the actual target values from the dataset [18]. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 (2) Where 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠: This refers to the count of instances where MPLSTM correctly predicts the target value, 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠: This is the total number of instances for which predictions were made by the MPLSTM. The result will be a value between 0 and 1, representing the percentage of correct predictions made by MPLSTM. Then, three error metrics mean square error (MSE), root mean squared error (RMSE), and mean absolute error (MAE) Loss are used to assess the model's performance. These measures are employed to evaluate several facets of the model's precision and prognostication [19]. 𝑀𝑆𝐸 = 1 𝑛 ∑ (𝑦𝑡 − 𝑦 ̅𝑡)2 𝑛 𝑖=1 (3) 𝑅𝑀𝑆𝐸 = √ 1 𝑛 ∑ (y𝑡 − 𝑦 ̅𝑡)2 𝑛 𝑖=1 (4) 𝑀𝐴𝐸 = 1 𝑛 ∑ |(y𝑡 − 𝑦 ̅𝑡)| 𝑛 𝑖=1 (5) Where n is the number of samples, y𝑡, 𝑦 ̅𝑡 the predicted and actual values respectively. RMSE measures the average magnitude of the prediction errors by taking the square root of the mean squared difference between the predicted y𝑡 and actual values 𝑦 ̅𝑡. It gindicateshow accurately the model predicts the desired variable. While MAE measures the average absolute difference between y𝑡, 𝑦 ̅𝑡. 3. LSTM enhancement Inspired by the transformer model's innovations [20], we enhance LSTM by incorporating transformer principles. This fusion includes self-attention and cross-attention mechanisms [21] similar to transformers, improving LSTM's ability to capture complex data dependencies, especially in large datasets. The resulting TransLSTM architecture combines LSTM and transformer strengths, making it adaptable and powerful for real-world applications and predictions. Figure 2 illustrates TransLSTM: input encoding converts tokens to continuous vectors, positional encodings provide context and positional information, transformer encoder blocks process sequences with multi-head self-attention and feedforward networks, LSTM Integration captures sequential dependencies, attention mechanism combines information from both sources, and the output layer produces final predictions. Figure 2. TransLSTM architecture
  • 5.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840 834 4. RESULTS AND DISCUSSION This section outlines the comparative study conducted to assess the performance of MPLSTM. The experimental procedure employs statistical analysis to evaluate the results obtained across all datasets, comparing MPLSTM with several state-of-the-art algorithms for data stream classification. The results of this study provide important new information about the performance of the proposed MPLSTM framework and its competitive position in the field of data stream classification techniques. 4.1. Dataset A total of 29 different time-series datasets were used in this study and came from the UCR repository, which is accessible to the public [22]. Stream clustering [23], anomaly detection [24], and data stream density estimation are just a few of the applications for which these datasets have been used in research in the past. Each dataset comprises of instances of a one-dimensional time series with a built-in grid structure. The IMDB dataset [25], introduced by Maas et al. [26], is a prominent benchmark for sentiment classification. It comprises 25,000 reviews in both the training and test sets, each limited to 30 reviews per movie for diversity. This balanced dataset contains an equal number of positive and negative reviews, establishing a 50% accuracy baseline if predictions were random. 4.2. Case study 1 The study encompassed a comprehensive exploration of various established techniques, aiming to encompass all algorithm families proposed in the literature for the given problem. Table 1 provides an overview of the evaluated classifiers, organized by their respective families, and includes the abbreviations used throughout this paper [27], [28]. The results obtained from the conducted experiments are presented and discussed. Additionally, the processing time on each dataset is analyzed, considering the significance of speed-up in a data streaming scenario. Table 1. Utilized models for case study1 Classifier Abbreviation Family Naive Bayes NaivBy Bayesian classifiers Adaptive Size H T AdptSHOFT Decision tree Stochastic gradient-descent StoGrdD Function classifiers Single classifier drift SnglCDrft Drift classifiers Leveraging bagging LvrgBag Meta classifiers bagging Adaptive random forest AdptRnF Meta classifiers bagging Boosting using adwin BoAdwin Meta classifiers boosting Multi-layer perception MLPrecept Neural networks Hyperparameter selection typically involves using rule-of-thumb parameters or proven combinations from previous studies. However, a systematic approach like grid search (GS) [29] is employed for meticulous hyperparameter selection. Grid search is favored due to its simplicity, parallelizability, and effectiveness in low-dimensional spaces. It entails discretizing hyperparameter value ranges and systematically testing all possible combinations. This approach explores diverse model configurations. Before training the final models, a validation run optimizes hyperparameters based on accuracy assessments. The training process ends when the maximum epoch limit is reached. MPLSTM configuration details are summarized in Table 2. In this paper, the Adam optimizer is chosen post-validation for its computational efficiency and slightly superior test results. A batch size of 32 is used for all models, and the sparse categorical cross entropy as a loss function [30] is employed. This loss function calculates the negative logarithm of the predicted probability for the true class index when applied to class indices, showing the model's level of assurance in the accuracy of its class prediction. Table 2. Hyperparameters used in tuning MPLSTM framework Network Parameter Configuration Dense 10 Epochs number 200 Optimization function ADAM optimizer Size of a batch 32 Learning rate 0.001 Activation function Softmax Loss function Sparse categorical cross entropy
  • 6. Int J Elec & Comp Eng ISSN: 2088-8708  Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief) 835 𝑆𝑝𝑎𝑟𝑠𝑒 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙 𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ∑ 𝑦𝑖 𝑙𝑜𝑔(𝑦𝑖 ̂) 𝑛 𝑖=1 (6) where n represents the classes’ number, 𝑦 represents the true label or target value of the ith class, and 𝑦 ̂ represents the predicted probability for the corresponding class. 4.3. Performance evaluation Batch sizes above 30 exhibit stable accuracy and processing times regardless of the number of batches, offering flexibility in parameter selection. However, batch sizes below 30 significantly degrade performance, hindering model adaptability to evolving data streams. Very small batch sizes overly focus on individual examples, preventing learning of overall data distribution changes. MPLSTM achieves high accuracy across various datasets, showcasing LSTM's suitability for time-series data streaming. Convergence of training and validation loss lines during model training is a positive indicator, signifying learning progress. MPLSTM reduces processing time significantly through parallel processing, enhancing accuracy and predictive capabilities. The trade-off with increased computational time should be considered based on application requirements. Figure 3 demonstrates parallel processing consistently outperforming sequential processing across 29 datasets, ensuring MPLSTM's effectiveness. Figure 3. A comparison between sequential and parallel execution processing times The FacesUCR dataset exhibits a speedup of 2 times when processed in parallel, indicating a significant improvement in processing time compared to sequential processing. Similarly, for the Pendigits dataset, the parallel model achieves a speedup of 1.7, further demonstrating the efficiency of parallel processing over the sequential approach in this case. Several other datasets, such as PhalangesOutlinesCorrect and TwoPatterns, also exhibit notable speedups larger than 1.5 times when processed in parallel. These findings further emphasize the effectiveness of MPLSTM in reducing processing time. The observed speedup across multiple datasets as shown in Figure 4 underscores the model's ability to leverage parallelism efficiently, resulting in faster dataset processing. By harnessing parallel processing, the MPLSTM demonstrates its capability to significantly improve performance and expedite data analysis tasks. Upon closer analysis, the processing time of individual datasets, such as the ECG5000, demonstrates notable improvements in processing time as shown in Figure 5, although the magnitude of the speedup may not be extremely high. However, when examining the Pendigits dataset, with its larger size and increased complexity, the benefits of parallel processing become increasingly pronounced. Consequently, the speedup achieved becomes substantially larger.
  • 7.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840 836 Figure 4. Dataset speedup curve when applying the MPLSTM framework Figure 5. A Comparison of the processing time of sequential and parallel execution of two datasets ECG5000 and Pendigits Figure 6 displays the learning curves, which depict the accuracy improvement across each epoch for two different datasets ECG5000 and Pendigits. These curves visually demonstrate how the model's accuracy increases as the training progresses. The learning curves for the two datasets show how well the model was trained and how well it could learn from the data. The consistent improvement in accuracy over the course of the epochs implies that the model is not just remembering the training data but also generalizing well to new data. This is encouraging for the model's ability to predict outcomes using fresh data. Similarly, Figure 7 illustrates the decrease in loss across each epoch for the same mentioned datasets. All these learning curves provide crucial insights into the model's performance and offer valuable guidance for enhancing its architecture and training procedure. The learning curves provide valuable insights into the model's performance and offer guidance for optimizing its architecture and training process. By addressing overfitting and considering early stopping, the model's accuracy can be further improved while maintaining good generalization capabilities.
  • 8. Int J Elec & Comp Eng ISSN: 2088-8708  Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief) 837 In Table 3, The proposed MPLSTM framework's effectiveness was assessed using MSE, RMSE, and MAE as evaluation metrics. It is desirable to have low values for these metrics as they indicate better performance. In this study, the framework yielded promising results with low values for example, it gives MSE=0.237 for the ECG5000, RMSE=0.583 for the PhalangesOutlinesCorrect, and MAE=0.074 for the pendigits. This implies that the predictions made by the MPLSTM model were close to the actual values. The model performed well because it was able to learn the patterns and relationships in the data. This led to accurate predictions and shows that the MPLSTM framework is an effective way to address this problem. Table 4, The accuracy table showcases the performance evaluation results for MPLSTM on the UCR dataset. It provides a comprehensive overview of the accuracy achieved by the framework in predicting the target variable. The table offers a detailed breakdown of the accuracy scores across different metrics or experimental configurations, allowing for a comprehensive analysis of the framework's performance. Researchers and practitioners can refer to this table to assess the effectiveness and reliability of MPLSTM in accurately predicting the target variable on the UCR dataset. Figure 6. The Accuracy curves of training and validation sets in two datasets ECG5000 and Pendigits Figure 7. The loss curves of training and validation sets of two datasets ECG5000 and Pendigits Table 3. Performance of MPLSTM in terms of MSE, RMSE, and MAE Dataset MSE RMSE MAE Wafer 0.449 0.670 0.2247 Pendigits 0.390 0.625 0.074 ECG5000 0.237 0.478 0.113 HandOutlines 0.361 0.601 0.361 PhalangesOutlinesCorrect 0.340 0.583 0.340 ChlorineConcentration 1.11 1.0536 0.340
  • 9.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840 838 Table 4. Accuracy of the top 7 classifiers for the UCR datasets compared with the MPLSTM framework Dataset Proposed DPMLSTM AdaptRnf MLPrecept NaivBy SnglCDrft AdaptSHOFT LvrgBag BoAdwin StoGrdD Wafer 0.100 0.982 0.991 0.194 0.192 0.356 0.963 0.960 0.542 Pendigits 0.976 0.950 0.938 0.824 0.784 0.850 0.867 0.909 0.800 ECG5000 0.941 0.856 0.877 0.750 0.772 0.750 0.752 0.843 0.833 ElectricDevices 0.85 0.526 0.526 0.456 0.456 0.456 0.468 0.457 0.194 HandOutlines 0.638 0.720 0.634 0.533 0.533 0.533 0.530 -0.084 0.475 PhalangesOutlinesCorrect 0.659 0.377 0.060 0.134 0.134 0.133 0.245 0.277 0.072 ChlorineConcentration 0.533 0.149 0.082 0.122 0.122 0.001 0.063 0.001 0.099 4.5. Case study 2 In this study, LSTM is integrated and the Transformer model to create TransLSTM, a novel architecture. TransLSTM leverages the Transformer's success in handling sequential data and capturing long- range dependencies. This fusion enhances LSTM's ability to model complex relationships and temporal dependencies in sequential data by incorporating self-attention and cross-attention mechanisms from the Transformer. The investigation demonstrates how TransLSTM can address LSTM's limitations, potentially leading to more accurate and efficient predictions. This case study highlights the innovative potential of combining diverse neural architectures for enhanced predictive capabilities. 4.6. TransLSTM evaluation The training history curves in the provided case study offer insights into the performance of two different models, LSTM and TransLSTM, across multiple epochs. In Figure 8, the first and the third curves represent training loss for LSTM and TransLSTM, while the second and fourth curves represent validation loss for LSTM and TransLSTM, respectively. These curves depict the evolution of training loss over epochs, showing a decreasing trend, and indicating learning from the training data. TransLSTM consistently achieves lower training loss and outperforms LSTM in validation loss, indicating better generalization to new data. Similarly, the other figure displays training and validation accuracy curves, with the first and the third curves representing training accuracy for LSTM and TransLSTM, and the second and the fourth curves representing validation accuracy. Both models exhibit an increasing trend in training accuracy, demonstrating efficient learning from the training data as well as the capacity to generalize to new, untried data. TransLSTM achieves higher training and validation accuracy, highlighting its superior data modeling capabilities. Figure 8. Comparison between LSTM and TransLSTM training and validation loss and accuracy 5. CONCLUSION In conclusion, this paper presents a novel framework for datastream regression, referred to as MPLSTM. The proposed framework effectively addresses the challenges associated with handling continuous and large-scale data in real-time prediction scenarios. By leveraging the inherent parallelism of LSTM networks, MPLSTM achieves a remarkable balance between high prediction accuracy and
  • 10. Int J Elec & Comp Eng ISSN: 2088-8708  Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief) 839 computational efficiency. Experimental evaluations, conducted on real-world datasets including the UCR dataset, validate the superior performance of MPLSTM compared to traditional regression models. The framework's ability to capture temporal dependencies and long-term patterns in streaming data is demonstrated through accurate predictions, as evidenced by accuracy measures and loss calculations. MPLSTM emerges as a promising approach for datastream prediction, showcasing improved performance and outperforming existing results in terms of accuracy and loss. REFERENCES [1] S. Bharany et al., “A comprehensive review on big data challenges,” in 2023 International Conference on Business Analytics for Technology and Security (ICBATS), Mar. 2023, pp. 1–7, doi: 10.1109/ICBATS57792.2023.10111375. [2] S. Homayoun and M. Ahmadzadeh, “A review on data stream classification approaches,” Journal of Advanced Computer Science and Technology, vol. 5, no. 1, Feb. 2016, doi: 10.14419/jacst.v5i1.5225. [3] S. Ray, “A quick review of machine learning algorithms,” in 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Feb. 2019, pp. 35–39, doi: 10.1109/COMITCon.2019.8862451. [4] F. Karim, S. Majumdar, H. Darabi, and S. Chen, “LSTM fully convolutional networks for time series classification,” IEEE Access, vol. 6, pp. 1662–1669, 2018, doi: 10.1109/ACCESS.2017.2779939. [5] S. Smyl and K. Kuber, “Data preprocessing and augmentation for multiple short time series forecasting with recurrent neural networks,” 36th International Symposium on Forecasting, 2016. [6] I. O. Muraina, “Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts,” in 7th International Mardin Artuklu Scientific Researches Conference, 2022, pp. 496–504. [7] J. Hunt, “Multiprocessing,” in Advanced Guide to Python 3 Programming, Springer International Publishing, 2019, pp. 363–376. [8] Z. A. Aziz, D. Naseradeen Abdulqader, A. B. Sallow, and H. Khalid Omer, “Python parallel processing and multiprocessing: a rivew,” Academic Journal of Nawroz University, vol. 10, no. 3, pp. 345–354, Aug. 2021, doi: 10.25007/ajnu.v10n3a1145. [9] X. Liang et al., “R-Drop: regularized dropout for neural networks,” Advances in Neural Information Processing Systems, vol. 13, pp. 10890–10905, 2021. [10] X. Ying, “An overview of overfitting and its solutions,” Journal of Physics: Conference Series, vol. 1168, no. 2, Feb. 2019, doi: 10.1088/1742-6596/1168/2/022022. [11] N. Watt and M. C. du Plessis, “Dropout for recurrent neural networks,” in Proceedings of the International Neural Networks Society, Springer International Publishing, 2020, pp. 38–47. [12] A. Zunino, S. A. Bargal, P. Morerio, J. Zhang, S. Sclaroff, and V. Murino, “Excitation dropout: encouraging plasticity in deep neural networks,” International Journal of Computer Vision, vol. 129, no. 4, pp. 1139–1152, Jan. 2021, doi: 10.1007/s11263-020- 01422-y. [13] D. Jha, A. Yazidi, M. A. Riegler, D. Johansen, H. D. Johansen, and P. Halvorsen, “LightLayers: parameter efficient dense and convolutional layers for image classification,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12606, Springer International Publishing, 2021, pp. 285–296. [14] P. Lara-Benítez, M. Carranza-García, D. Gutiérrez-Avilés, and J. C. Riquelme, “Data streams classification using deep learning under different speeds and drifts,” Logic Journal of the IGPL, vol. 31, no. 4, pp. 688–700, Jul. 2023, doi: 10.1093/jigpal/jzac033. [15] O. Du, Y. Zhang, X. Li, J. Zhu, T. Zheng, and Y. Li, “Multi-view heterogeneous network embedding,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13369 LNAI, Springer International Publishing, 2022, pp. 3–15. [16] Z. Zhang, “Improved Adam optimizer for deep neural networks,” in 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Jun. 2018, pp. 1–2, doi: 10.1109/IWQoS.2018.8624183. [17] Y. Gao, W. Liu, and F. Lombardi, “Design and implementation of an approximate softmax layer for deep neural networks,” 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 2020, pp. 1-5, doi: 10.1109/iscas45731.2020.9180870. [18] J. Gama, R. Sebastião, and P. P. Rodrigues, “On evaluating stream learning algorithms,” Machine Learning, vol. 90, no. 3, pp. 317–346, Oct. 2013, doi: 10.1007/s10994-012-5320-9. [19] T. Chai and R. R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)? -Arguments against avoiding RMSE in the literature,” Geoscientific Model Development, vol. 7, no. 3, pp. 1247–1250, Jun. 2014, doi: 10.5194/gmd-7-1247-2014. [20] A. Vaswani et al., “Attention is all you need,” arXiv:1706.03762, Jun. 2017. [21] Z. Huang, P. Xu, D. Liang, A. Mishra, and B. Xiang, “TRANS-BLSTM: transformer with bidirectional LSTM for language understanding,” arXiv:2003.07000, Mar. 2020. [22] Y. Chen et al., “The UCR time series classification archive,” NSF, Jul. 2015, https://www.cs.ucr.edu/~eamonn/time_series_data/ (accessed Jul. 13, 2023) [23] A. Bifet, G. De Francisci Morales, J. Read, G. Holmes, and B. Pfahringer, “Efficient online evaluation of big data stream classifiers,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2015, pp. 59–68, doi: 10.1145/2783258.2783372. [24] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection for discrete sequences: a survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 5, pp. 823–839, May 2012, doi: 10.1109/TKDE.2010.235. [25] IMDb, “Internet movie database,” IMDb datasets. https://datasets.imdbws.com/ (accessed Jul. 13, 2023). [26] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, vol. 1, pp. 142–150. [27] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: massive online analysis,” Journal of Machine Learning Research, vol. 11, pp. 1601–1604, 2010. [28] P. Fabian et al., “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [29] B. H. Shekar and G. Dagnew, “Grid search-based hyperparameter tuning and classification of microarray cancer data,” in 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), Feb. 2019, pp. 1–8, doi: 10.1109/ICACCP.2019.8882943. [30] S. Mannor, D. Peleg, and R. Rubinstein, “The cross entropy method for classification,” in Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 561–568, doi: 10.1145/1102351.1102422.
  • 11.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 14, No. 1, February 2024: 830-840 840 BIOGRAPHIES OF AUTHORS Nada Adel Dief is a computer engineer interested in big data, deep learning, machine learning, and text mining. She received her master of science degree from Mansoura University in 2016 and is currently working as a teaching assistant for the Faculty of Engineering Computer and System Department. Department of Computer Engineering and Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt. She can be contacted at email: nadadief@mans.edu.eg. Mofreh Mohamed Salem received his Ph.D. degree from Strathclyde University, U.K., in 1985. He was the director of the Software Engineering Unit, Faculty of Engineering, from 2001 to 2006. He was the head of the Computers Engineering and Control Department, Faculty of Engineering, Mansoura University, Egypt, from 2004 to 2008, where he is currently a member of the Computer Center Council. He was the Dean of the High Institute for Computers in Mansoura, from 2008 to 2011. He has published 92 scientific articles in international journals periodicals and conferences of computer engineering. His current research interests include software engineering, computer systems design, parallel processing, computer networks, cloud computing, and big data. Department of Computer Engineering and Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt. He can be contacted at email: dr_mofreh@mans.edu.eg. Asmaa Hamdy Rabie received a B.Sc. in computers and systems engineering, with a general grade of excellent with class honors in 2013. She got her master's degree in the area of load forecasting using data mining techniques in 2016 at the Computers Engineering and System Department, Mansoura University, Egypt. She got her Ph.D. degree in load forecasting using data mining techniques in 2020 at Computers Engineering and System Department, Mansoura University, Egypt. Her interests (programming languages, classification, big data, data mining, healthcare systems, and the internet of things), she is currently a lecturer in the faculty of Engineering, at Mansoura University, Egypt. She can be contacted at email: asmaa91hamdy@yahoo.com. Ali Ibrahim EL-Desouky holds received his MSc and Ph.D. degrees from the University of Glasgow, USA. He is currently a full professor with the Computers Engineering and Systems Department, Faculty of Engineering, Mansoura University, Egypt. He is also a visiting part-time professor with MET Academy. He also teaches at American and Mansoura universities and has taken over many positions of leadership and supervision of many scientific articles. He has published hundreds of articles in well-known international journals. Department of Computer Engineering and Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt. Email: adesoky@mans.edu.eg.