SlideShare a Scribd company logo
Bulletin of Electrical Engineering and Informatics
Vol. 10, No. 6, December 2021, pp. 3137~3146
ISSN: 2302-9285, DOI: 10.11591/eei.v10i6.2802 3137
Journal homepage: http://beei.org
Development of 3D convolutional neural network to recognize
human activities using moderate computation machine
Malik A. Alsaedi, Abdulrahman S. Mohialdeen, Baraa M. Albaker
College of Engineering, Al-Iraqia University, Sabe’ Abkar, Adhamiya, Baghdad, Iraq
Article Info ABSTRACT
Article history:
Received Jan 12, 2021
Revised May 20, 2021
Accepted Oct 9, 2021
Human activity recognition (HAR) is recently used in numerous applications
including smart homes to monitor human behavior, automate homes
according to human activities, entertainment, falling detection, violence
detection, and people care. Vision-based recognition is the most powerful
method widely used in HAR systems implementation due to its characteristics
in recognizing complex human activities. This paper addresses the design of a
3D convolutional neural network (3D-CNN) model that can be used in smart
homes to identify several numbers of activities. The model is trained using
KTH dataset that contains activities like (walking, running, jogging,
handwaving handclapping, boxing). Despite the challenges of this method due
to the effectiveness of the lamination, background variation, and human body
variety, the proposed model reached an accuracy of 93.33%. The model was
implemented, trained and tested using moderate computation machine and the
results show that the proposal was successfully capable to recognize human
activities with reasonable computations.
Keywords:
3D-CNN
Convolutional neural network
Deep learning
HAR
Smart home
Vision
This is an open access article under the CC BY-SA license.
Corresponding Author:
Abdulrahman S. Mohialdeen
College of Engineering
Al-Iraqia University
Sabe’ Abkar, Adhamiya, Baghdad, Iraq
Email: abd_saeed@aliraqia.edu.iq
1. INTRODUCTION
HAR is one of the challenging subjects because of the huge number of human activities, some of the
activities can be easily noticed some of them are confusing, and some of them require interaction with other
objects or humans, besides the diversity of the activities, the recognition methods are also diverse. There are
many types of data required to recognize a human activity, some of them use ambient sensors like
accelerometer, gyroscope, humidity, and temperature [1], [2]. Some of them get the benefit of the
smartphone's sensors like accelerometer and gyroscope [3], [4], the other use the radio frequency [5]. But the
most popular recognizing methods use vision-based recognition [6]-[14].
Vision uses images or videos to recognize the activity. Also, there is a lot of challenges for
recognizing human activities because of the effect of lamination, variance of background. Still, the question
is how to process these visual data to recognize the activity. The answer there are many techniques most of
them use machine learning, and deep learning has shown an excellent benefit for recognizing human
activities, especially the CNN which are very useful for vision-based data recognition.
In this paper, we will design a neural network architect a.k.a. model, that can be used for human
activity monitoring, the model proposed of 3D dimensional CNN (3D-CNN), and the purpose of using
3D-CNN is to extract spatial and temporal features rather than only spatial features, and the activity consists
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146
3138
of multiple movements that can be identified by extracting the temporal features from between several
numbers of frames.
Our proposed model is a small number of 3D-CNN layers to reduce the amount of processing time
so that any computer with low computational ability could recognize human activity online without delay.
CNN was introduced by Fukushima [12]. At the beginning, it was mainly used for image classification, and
image features extraction, one of the most attractive models that been invented to get into a challenge to
classify images [13], [14]. During that many video human activity datasets were published and KTH [15] is
one of the most popular small size datasets. CNN architect encouraged researchers to use it in HAR with
image data taken from video dataset [10], [16]. Other researchers proposed to use two-stream CNN where the
first stream is an image data and the second it's optical flow [16], [17] 3D-CNN for HAR most cited articles
[18] proposed model with TRECVID dataset, and [19] proposed 3D-CNN model for UCF-101 dataset [20] in
which our model inspired by their model.
2. RESEARCH METHOD
This paper proposing to use deep neural networks (DNN) models that consists of many different
layers, and each layer has its purpose during training and testing. Still, the dominant layer in which the article
focused on is the Convolutional neural network, which is one of the most widely used neural networks, and
its central idea is applying filters to the input data or convolute it, and transfer the convoluted data to the next
layer. CNN was primarily used to deal with image data. Now there are 1D, 2D, and 3D CNN to get the
benefit of this architect for another type of data with a different number of dimensions. 3D-CNN used for
three-dimensional data which is very suited for our project, because we are dealing with video data, and the
reason for using video data rather than image data is the activity made of several consequential movements of
body parts. This continuous movement can be noticed with successive images, and this is a video.
Pooling layers used in this paper are max-pooling, which returns the maximum value within a kernel
size when it wraps around the data, average-pooling which returns the average amount within a kernel size
that wraps along with the data, and global-average-pooling, which returns the average value of each
dimension or each kernel in the CNN layer, and that is why it is useful at the final part of the model to reduce
the number of parameters, and to help the model overcome overfitting dropout layers are used.
2.1. Proposed models
The first suggested model shown in Figure 1 is influenced by the model proposed by Tran et al.
[19]. The model consists of three connected 3D convolutional neural networks with 3*3*3 kernel size for all
convolutional layers, 6ed by a max-pooling layer with a kernel size of 2*2*2 for all max-pooling layers.
Zero-padding is used for all convolutional layers, and ‘ReLU’ activation function added after each
convolutional and fully connected (FC) layers except for the last was SoftMax.
Figure 1. Initial proposed model
In the second attempt, other convolutional and max-pooling layers added before flatten layer, to
increase the accuracy. Then dropouts were added in different places in the model with five attempts to get
high accuracy, where all dropouts were with a 0.5 percentage factor. We try with our model to reduce the
number of parameters (weights and biases) and return to the first model with dropouts. The accuracy
increased using the dropouts. But, the number of parameters in huge, because the output of the layer before
flatten layer was more extensive than before removing convolutional and pooling layers, and as the pooling
layer is gone the number of parameters increased, in (1)-(3) shows how the number of parameters calculated.
Input
30*40*40*1
Conv3D
64
MaxPool
Conv3D
128
MaxPool
Conv3D
256
MaxPool
Flatten
FC
128
FC
32
FC
6
SoftMax Output
6 classes
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Development of 3D convolutional neural network to recognize human … (Malik A. Alsaedi)
3139
So, to solve this problem GlobalAveragePooling3D layer replaced the flatten layer, this replacement reduced
the number of parameters to about two million parameters, which is very helpful for low or moderate
computation capabilities machines to deal with online activity recognition.
No. of parameters in CNN=(filterwidth * filterheight * filterdepth + 1) * no. of filters (1)
No. of parameters in FC net=neurons of current layer * neurons of previous layer (2)
No. of parameters in flatten=multiplication of all previous layer dimension (3)
3. RESULTS AND DISCUSSION
The neural network model is trained on the KTH dataset, the dataset is doubled before using it by
adding a flipped copy of it, 20% of data is taken for test after training, and another 20% taken for the test
during the training, and 60% were taken for the training process. The model was trained using Tensorflow-
v2.1 [21] as backend, and Keras-v2.3.1 [22] using Python-language-v3.7.5 as front-end, with batch size=16
and the shape of the frame taken from the video were 40x40 pixels with one channel (grayscale), from each
video 30 frames were taken between each frame and another there were four frames between taken frames
discarded.
The optimizer of the model was Adam optimizer [23], and with a learning rate of 0.001 and
categorical cross-entropy loss function, the machine specification was: HP 15 Notebook, Memory: 12288MB
RAM, Intel-Core i5-3230M processor with four cores, two of them are physical, and the maximum frequency
is 2.6GHz and Windows 8.1 Enterprise 64-bit OS (6.3, build 9600).
Validation data controls the training operation. So, if an update made to the model and validation
data applied to the model and the losses did not improved for three epochs for the learning rate would be
multiplied by a half and the minimum reduction is 0.0001. If the losses did not improve for 15 epochs
consequently, the training would be finished before getting to the given number of epochs is 100.
3.1. Calculating results
After training operation finishes test samples are pushed to the model to get the response the
accuracy, precision, recall, and f1_score are calculated using (4)-(7) which uses confusion matrix shown in
Figure 2 [24], for average loss is calculated using categorical cross-entropy algorithm [25].
Figure 2. Confusion matrix annotation
Accuracy=
∑ tpi
l
i=1 +tni
∑ (tpi+fni+fpi+tni
)
l
i=1
(4)
In (5) shows the way to calculate accuracy which defines the effectiveness of the model overall.
Precision=
∑ tpi
l
i=1
∑ (tpi+fpi)
l
i=1
(5)
According to (5) shows the way to calculate precision which determines the matching between the
label of classes and the calculated labels. In (6) shows the way to calculate recall which demonstrates the
effectiveness of the model to identify the label of classes. As shown in (7) shows the way to calculate
F1_score which defines the relation between output data taken from the model after entering data for test and
the positive labels.
In (8) show the way to calculate the average loss, where N is the number of samples, M is the
number of classes, d is the true label or desired output, and y is the calculated or tested output from the
model. Table 1 shows the calculated results, and it’s figure number. Table 2 shows a comparison of
accuracies for several studies done on the KTH dataset for human activity recognition and our study
accuracy, and it is evident that our method has shown a remarkable improvement according to accuracy.
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146
3140
Recall=
∑ tpi
l
i=1
∑ (tpi+fni)
l
i=1
(6)
F1_score=
(β2+1)∗Precision+Recall
β2∗Precision+Recall
(8)
Loss=
∑ −
1
M
∑ di,jlog⁡(yi,j)
M
j=1
N
i=1
N
(9)
Table 1. Calculated results for all models
No. No. of figure Accuracy % Loss Precision % Recall % F1_score %
1. Figure 3 85.83 0.39 86.30 85.89 86.09
2. Figure 4 88.75 0.39 88.97 88.97 88.97
3. Figure 5 92.08 0.23 92.67 91.90 92.28
4. Figure 6 92.92 0.22 93.08 92.95 93.02
Table 2. Comparison of accuracies of researches done on KTH
No. Method Accuracy %
1 Ahmad and Lee [26] 84.83
2 Taylor et al. [27] 88.00
3 Qian et al. [28] 88.69
4 Our method 93.33
Number the table consecutively according to the first mention (sequential order).
3.2. Calculating the number of operations for layers
We want to calculate an approximate number of operations for each layer, the calculations don’t
include controlling operations, calculations are detailed below:
a. 3D-CNN each kernel convolutes on the entire input data, and for 3D-CNN with a kernel size of (Kd,
Kh, Kw), input data of (frames, height, width) and strides are (strided, strideh, stridew), we would have:
No. of operations per node=(Kd * Kh * Kw + 1)2 * no. of previous kernels (9)
Output nodes=((frames-Kd)/strided+1) * (height-Kh)/strideh+1) *((width-
Kw)/stridew+1) *no. of current kernels
(10)
Output nodes=(frames/strided+1) * (height/strideh+1) * (width/stridew+1) *
no. of current kernels
(11)
In (9) shows the number of operations for each output node came from the convolution operation
and the power two because we have multiplications, in (10) [29] shows the number of output node and for
each output node we have. For 3D-CNN layer, but for no zero paddings, if we use padding which is used in
our proposed model the number of operation would be as shown in (11) [29].
b. Maxpooling3D performs comparison operation, for a pool window size of (Pd, Ph, Pw), input data of
(frames, height, width) and strides are (strided, strideh, stridew), we would have:
No. of operations per node=Pd * Ph * Pw (12)
Output nodes=((frames-Pd)/strided+1) * ((height-Ph)/strideh+1) * ((width-
Pw)/stridew+1) no. of current kernels
(13)
Output nodes=(frames/strided+1) * (height/strideh+1) * (width/stridew+1) *
current kernels
(14)
In (12) shows the number of operations for each pool window, in (13) shows the number of output
node without padding which is used in our model and the size of the stride are the same pool window size, in
(14) shows the number of output nodes if there are zero paddings.
c. Fully connected has a vast number of parameters as compared with CNN if the number of neurons of
the previous layer is Nprevious and the number of neurons of the current layer is Ncurrent.
In (15) shows the number of operations for a fully connected layer, power two because we have
multiplications.
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Development of 3D convolutional neural network to recognize human … (Malik A. Alsaedi)
3141
Number of operations=(Nprevious * Ncurrent)2 (15)
d. Flatten has only one operation which is reshaping the dimensions into one dimension.
e. Dropout works only during training operation, and it’s just hidden randomly chosen a portion of nodes
so as not to participate in producing the output at only some point in the training, so for testing it
doesn’t cost any operations.
f. GlobalAveragePooling3D adds nodes for each channel where the previous layer output is (frames,
height, width, channels), so it adds all the numbers in frames, height, and width for a particular channel
and divides by the number of (frames * height * width), so the total number of operations is shown
in (16).
Number of operations=(frames * height * width) * channels (16)
Table 3 shows the number of operations for each model proposed according to their figures, and we
can see the least number of operations is the model with the least number of parameters and high accuracy of
92.92%.
Table 3. Number of operations for each model according to its figure
No. of figure No. of operations
Figures 3 1.08 * 1013
Figures 4, 5 5.77 * 1012
Figure 6 4.77 * 1012
4. DISCUSSION
Each result is discussed according to the number of figures.
Figure 4 shows model architect, confusion matrix, accuracy, and loss figures for the earlier proposed
model. We can notice that because the last before flatten were not small we got a massive number of the
parameters for the fully connected layer, also we can see that the model’s learning was saturated in early time
which completed learning within 30 epochs. Figure 5 shows model architect, confusion matrix, accuracy and
loss figures for the proposed model after adding Conv3D and MaxPooling before flatten to reduce the
number of parameters, and the training time and the training finished within 30 epochs. Figure 6, shows
model architect, confusion matrix, accuracy, and loss figures for the model after adding dropouts after the
fourth, sixth, and eighth layers, the accuracy for this model shown remarkable improvement and the training
finished in epoch 100 which is the final demanded epoch. Figure 6 shows model architect, confusion matrix,
accuracy, and loss figures for the model after changing flatten layer with GlobalAveragePooling3D, which
reduced the number of parameters and so the training time per epoch. It also reduced the number of
operations during testing and got a fantastic accuracy of 92.92%, the training was finished in epoch 82.
Layer (type) Output shape Parameters
Conv3D 30, 40, 40, 64 1792
MaxPooling3D 15, 20, 20, 64 0
Conv3D 15, 20, 20, 128 221312
MaxPooling3D 7, 10, 10, 128 0
Conv3D 7, 10, 10, 256 884992
MaxPooling3D 3, 5, 5, 256 0
Flatten 19200 19200
Dense 128 2457728
Dense 32 4128
Dense 6 198
Total Parameters 3,570,150
(a) (b)
Figure 3. Earlier proposed model; (a) model architecture, (b) confusion matrix
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146
3142
(c) (d)
Figure 4. Earlier proposed model; (c) accuracy of training and validation, (d) losses of training and validation
(continue)
Layer (type) Output shape Parameters
Conv3D 30, 40, 40, 64 1792
MaxPooling3D 15, 20, 20, 64 0
Conv3D 15, 20, 20, 128 221312
MaxPooling3D 7, 10, 10, 128 0
Conv3D 7, 10, 10, 256 884992
MaxPooling3D 3, 5, 5, 256 0
Conv3D 3, 5, 5, 256 1769728
MaxPooling3D 1, 2, 2, 256 0
Flatten 1024 0
Dense 128 131200
Dense 32 4128
Dense 6 198
Total Parameters 3,013,350
(a) (b)
(c) (d)
Figure 5. Adding fourth convolutional and max-pooling layers; (a) model architecture, (b) confusion matrix,
(c) accuracy of training and validation, (d) losses of training and validation
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Development of 3D convolutional neural network to recognize human … (Malik A. Alsaedi)
3143
Layer (type) Output shape Parameters
Conv3D 30, 40, 40, 64 1792
MaxPooling3D 15, 20, 20, 64 0
Conv3D 15, 20, 20, 128 221312
MaxPooling3D 7, 10, 10, 128 0
Dropout (0.5) 7, 10, 10, 128 0
Conv3D 7, 10, 10, 256 884992
MaxPooling3D 3, 5, 5, 256 0
Dropout (0.5) 3, 5, 5, 256 0
Conv3D 3, 5, 5, 256 1769728
MaxPooling3D 1, 2, 2, 256 0
Dropout (0.5) 1, 2, 2, 256 0
Flatten 1024 0
Dense 128 131200
Dense 32 4128
Dense 6 198
Total Parameters 3,013,350
(a) (b)
(c) (d)
Figure 6. Dropouts before third and fourth convolutional and also before flatten layer; (a) model architecture,
(b) confusion matrix, (c) accuracy of training and validation, (d) losses of training and validation
Layer (type) Output shape Parameters
Conv3D 30, 40, 40, 64 1792
MaxPooling3 15, 20, 20, 64 0
Conv3D 15, 20, 20, 128 221312
MaxPooling3 7, 10, 10, 128 0
Conv3D 7, 10, 10, 256 884992
MaxPooling3 3, 5, 5, 256 0
Dropout (0.5) 3, 5, 5, 256 0
GlobalAVG3D 256 0
Dropout (0.5) 256 0
Dense 128 32896
Dense 32 4128
Dense 6 198
Total parameters 1,145,318
(a) (b)
Figure 6. Replacing flatten with global average pooling; (a) model architecture, (b) confusion matrix
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146
3144
(c) (d)
Figure 6. Replacing flatten with global average pooling; (c) accuracy of training and validation, (d) losses of
training and validation (continue)
We can see that dropout has great benefit, but this benefit can’t be taken unless when we put
dropout in the right place, we can see that there were several changes to the place and number of dropouts
when had seen that dropouts increases the accuracy when it was added before the third convolutional and
flatten layers but decreased slightly decreased when added before fourth convolutional layer. Then the place
of dropouts was changed to be before and after flatten layer, in which we got the maximum accuracy, after
that this increasing tested for the model with a smaller number of layers for the aim of decreasing the number
of parameters. The results were helpful, then complete the operation of parameters decreasing, flatten layer
has been replaced by Global-average-pooling, which reduced the number of parameters for the model by two
million parameters.
5. CONCLUSION
We have designed a model that can be used for online human activity recognition using moderate
computation machine. The accuracy of our proposed model was raised to 93.33%, and 92.92% for the model
with reduced amount of parameters. The last presented model is useful for moderate computation capabilities
machines, due to its low number of parameters and a low number of mathematical operations. We have
reached this high accuracy by getting the benefit of dropouts, and decreasing learning rate during training
when there is no improvement. The model with a low number of mathematical operations could be used for
online human activity recognition in a smart houses, helping monitoring human activities in the houses. We
intend to do more augmentation for the data to increase the overall accuracy, where only flipping
augmentation is made to the data.
REFERENCES
[1] X. Zhou, W. Liang, K. I. Wang, H. Wang, L. T. Yang and Q. Jin, "Deep-Learning-Enhanced Human Activity
Recognition for Internet of Healthcare Things," in IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6429-6438,
July 2020, doi: 10.1109/JIOT.2020.2985082..
[2] V. Bianchi, M. Bassoli, G. Lombardo, P. Fornacciari, M. Mordonini and I. De Munari, "IoT Wearable Sensor and
Deep Learning: An Integrated Approach for Personalized Human Activity Recognition in a Smart Home
Environment," in IEEE Internet of Things Journal, vol. 6, no. 5, pp. 8553-8562, Oct. 2019, doi:
10.1109/JIOT.2019.2920283.
[3] A. K. M. Masum, A. Barua, E. H. Bahadur, M. R. Alam, M. A. U. Z. Chowdhury and M. S. Alam, "Human
Activity Recognition Using Multiple Smartphone Sensors," 2018 International Conference on Innovations in
Science, Engineering and Technology (ICISET), 2018, pp. 468-473, doi: 10.1109/ICISET.2018.8745628.
[4] M. M. Hassan, M. Z. Uddin, A. Mohamed and A. Almogren, “A robust human activity recognition system using
smartphone sensors and deep learning,” Future Generation Computer Systems, vol. 81, pp. 307-313, 2018, doi:
10.1016/j.future.2017.11.029.
Bulletin of Electr Eng & Inf ISSN: 2302-9285 
Development of 3D convolutional neural network to recognize human … (Malik A. Alsaedi)
3145
[5] X. Wu, Z. Chu, P. Yang, C. Xiang, X. Zheng and W. Huang, "TW-See: Human Activity Recognition Through the
Wall With Commodity Wi-Fi Devices," in IEEE Transactions on Vehicular Technology, vol. 68, no. 1, pp. 306-
319, Jan. 2019, doi: 10.1109/TVT.2018.2878754.
[6] A. Diba et al., “Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification,”
Computer Science, 2017.
[7] T. Lima, B. Fernandes and P. Barros, "Human action recognition with 3D convolutional neural network," 2017
IEEE Latin American Conference on Computational Intelligence (LA-CCI), 2017, pp. 1-6, doi: 10.1109/LA-
CCI.2017.8285700.
[8] J. Carreira and A. Zisserman, “Quo Vadis, action recognition? A new model and the kinetics dataset,” Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2017-Janua, pp. 4724–4733,
2017.
[9] R. Singh, A. K. S. Kushwaha and R. Srivastava, “Multi-view recognition system for human activity based on
multiple features for video surveillance system,” Multimedia Tools and Applications, vol. 78, no. 12, pp. 17165-
17196, 2019, doi: 10.1007/s11042-018-7108-9.
[10] H. D. Mehr and H. Polat, "Human Activity Recognition in Smart Home With Deep Learning Approach," 2019 7th
International Istanbul Smart Grids and Cities Congress and Fair (ICSG), 2019, pp. 149-153, doi:
10.1109/SGCF.2019.8782290.
[11] Z. Tu et al., “Multi-stream CNN: Learning representations based on human-related regions for action recognition,”
Pattern Recognition, vol. 79, pp. 32-43, 2018, doi: 10.1016/j.patcog.2018.01.020.
[12] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition
unaffected by shift in position,” Biological Cybernetics, vol. 36, no. 4, pp. 193-202, 1980, doi:
10.1007/BF00344251.
[13] J J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei, "ImageNet: A large-scale hierarchical image
database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255, doi:
10.1109/CVPR.2009.5206848.
[14] A. Krizhevsky, I. Sutskever and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,”
Advances in neural information processing systems, vol. 60, no. 6, pp. 84-90, 2017, doi: 10.1145/3065386.
[15] “KTH dataset,” 2005. [Online]. Available: https://www.csc.kth.se/cvap/actions/. [Accessed: 27-May-2020].
[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, "Large-Scale Video Classification
with Convolutional Neural Networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014,
pp. 1725-1732, doi: 10.1109/CVPR.2014.223.
[17] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in
Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 1, no. 1, pp. 568-
576, 2014, doi: 10.5555/2968826.2968890.
[18] S. Ji, W. Xu, M. Yang and K. Yu, "3D Convolutional Neural Networks for Human Action Recognition," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, Jan. 2013, doi:
10.1109/TPAMI.2012.59.
[19] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri, “Learning spatiotemporal features with 3D
convolutional networks,” Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2015
Inter, pp. 4489-4497, 2015.
[20] K. Soomro, A. R. Zamir and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From Videos in The
Wild,” Computer Vision and Pattern Recognition, no. November, 2012.
[21] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” Distributed,
Parallel, and Cluster Computing, 2016.
[22] François Chollet, “Keras,” 2015. [Online]. Available: https://keras.io/. [Accessed: 08-Jun-2020].
[23] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” Computer Science, Mathematics, pp. 1-
15, 2015.
[24] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,”
Information Processing & Management, vol. 45, no. 4, pp. 427-437, 2009, doi: 10.1016/j.ipm.2009.03.002.
[25] Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,”
32nd Conference on Neural Information Processing Systems (NeurIPS), 2018, vol. 2018-Decem, no. NeurIPS, pp.
8778-8788, doi: 10.5555/3327546.3327555.
[26] M. Ahmad and S. W. Lee, “Human action recognition using shape and CLG-motion flow from multi-view image
sequences,” Pattern Recognition, vol. 41, no. 7, pp. 2237-2252, 2008, doi: 10.1016/j.patcog.2007.12.008.
[27] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional learning of spatio-temporal features,”
European conference on computer vision, Springer, Berlin, Heidelberg, 2010, vol. 6316 LNCS, no. PART 6, pp.
140-153, doi: 10.1007/978-3-642-15567-3_11.
[28] H. Qian, Y. Mao, W. Xiang and Z. Wang, “Recognition of human activities using SVM multi-class classifier,”
Pattern Recognition Letters, vol. 31, no. 2, pp. 100-111, 2010, doi: 10.1016/j.patrec.2009.09.019.
[29] I. Vasilev, D. Slater, G. Spacagna, P. Roelants and V. Zocca, "Python Deep Learning," 2nd Editio. Birmingham:
Packt Publishing, 2019.
 ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146
3146
BIOGRAPHIES OF AUTHORS
Malik Alsaedi is Asst. Prof. of electrical engineering. He finished his B.Sc. degree from
University of Technology University  Baghdad, the M.Tch. degree from JNTU University
India and the Ph.D. degree from UTM university  Malaysia. Currently position a deputy dean
of engineering college  Al-Iraqia University Iraq. He is interested in optical communication
and IoT technology.
Abdulrahman S. Mohialdeen has a Bachelor degree in Electrical Engineering from
University of Baghdad, Master degree in Computer Engineering from Al-Iraqia University,
research interest in deep learning, human activity recognition, and computer vision.
Baraa Munqith Albaker received both B.Sc. degree in electrical engineering and
M.Sc. degree in computer and control engineering from University of Baghdad, Iraq, and
Ph.D. degree in control engineering from University of Malaya, Malaysia. He had worked in
industry on data acquisition systems and radar signal processing and analysis for over three
years. He was a lecturer at University of Baghdad for four years. Next, he was a senior lecturer
of UMPEDAC research Centre, University of Malaya for two years. Currently, he works as
head of Networks Engineering department at Al-Iraqia University. His research interests focus
on contemporary development in computer and control applications.

More Related Content

PDF
M017427985
PDF
O017429398
PDF
MULTIPLE RECONSTRUCTION COMPRESSION FRAMEWORK BASED ON PNG IMAGE
PDF
Image Compression and Reconstruction Using Artificial Neural Network
PDF
40120140507006
PDF
Image compression and reconstruction using a new approach by artificial neura...
PDF
J017426467
PDF
538 207-219
M017427985
O017429398
MULTIPLE RECONSTRUCTION COMPRESSION FRAMEWORK BASED ON PNG IMAGE
Image Compression and Reconstruction Using Artificial Neural Network
40120140507006
Image compression and reconstruction using a new approach by artificial neura...
J017426467
538 207-219

What's hot (16)

PDF
I017425763
PDF
An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images
PDF
IRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
PDF
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM
DOCX
Thesis on Image compression by Manish Myst
PDF
H0114857
PDF
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
PDF
A Review of Comparison Techniques of Image Steganography
PDF
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
PDF
A ROBUST CHAOTIC AND FAST WALSH TRANSFORM ENCRYPTION FOR GRAY SCALE BIOMEDICA...
PDF
A new image steganography algorithm based
PPT
Applying Deep Learning with Weak and Noisy labels
PPTX
Image recognition
PDF
USING BIAS OPTIMIAZATION FOR REVERSIBLE DATA HIDING USING IMAGE INTERPOLATION
PDF
Fuzzy Type Image Fusion Using SPIHT Image Compression Technique
PDF
“Applying the Right Deep Learning Model with the Right Data for Your Applicat...
I017425763
An Efficient Analysis of Wavelet Techniques on Image Compression in MRI Images
IRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM
Thesis on Image compression by Manish Myst
H0114857
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
A Review of Comparison Techniques of Image Steganography
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
A ROBUST CHAOTIC AND FAST WALSH TRANSFORM ENCRYPTION FOR GRAY SCALE BIOMEDICA...
A new image steganography algorithm based
Applying Deep Learning with Weak and Noisy labels
Image recognition
USING BIAS OPTIMIAZATION FOR REVERSIBLE DATA HIDING USING IMAGE INTERPOLATION
Fuzzy Type Image Fusion Using SPIHT Image Compression Technique
“Applying the Right Deep Learning Model with the Right Data for Your Applicat...
Ad

Similar to Development of 3D convolutional neural network to recognize human activities using moderate computation machine (20)

PDF
Activity recognition based on spatio-temporal features with transfer learning
PDF
Human Action Recognition Using Deep Learning
PDF
Human Activity Recognition System
PDF
Crime Detection using Machine Learning
PDF
Human Activity Recognition
PDF
Detection of abnormal human behavior using deep learning
PDF
A Intensified Approach on Deep Neural Networks for Human Activity Recognition...
PDF
Human Action Recognition using Contour History Images and Neural Networks Cla...
PDF
Human Activity Recognition Using Neural Network
PDF
Video Classification: Human Action Recognition on HMDB-51 dataset
PDF
Human activity recognition with self-attention
PPTX
Human activity recognition updated 1 - Copy.pptx
PPTX
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
PDF
Human Action Recognition in Videos
PDF
IRJET - Creating a Security Alert for the Care Takers Implementing a Vast Dee...
PDF
Human Activity Recognition Using AccelerometerData
PDF
Attention correlated appearance and motion feature followed temporal learning...
PDF
improving Profile detection using Deep Learning
DOCX
Abstract.docx
PDF
Intelligent Video Surveillance System using Deep Learning
Activity recognition based on spatio-temporal features with transfer learning
Human Action Recognition Using Deep Learning
Human Activity Recognition System
Crime Detection using Machine Learning
Human Activity Recognition
Detection of abnormal human behavior using deep learning
A Intensified Approach on Deep Neural Networks for Human Activity Recognition...
Human Action Recognition using Contour History Images and Neural Networks Cla...
Human Activity Recognition Using Neural Network
Video Classification: Human Action Recognition on HMDB-51 dataset
Human activity recognition with self-attention
Human activity recognition updated 1 - Copy.pptx
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Human Action Recognition in Videos
IRJET - Creating a Security Alert for the Care Takers Implementing a Vast Dee...
Human Activity Recognition Using AccelerometerData
Attention correlated appearance and motion feature followed temporal learning...
improving Profile detection using Deep Learning
Abstract.docx
Intelligent Video Surveillance System using Deep Learning
Ad

More from journalBEEI (20)

PDF
Square transposition: an approach to the transposition process in block cipher
PDF
Hyper-parameter optimization of convolutional neural network based on particl...
PDF
Supervised machine learning based liver disease prediction approach with LASS...
PDF
A secure and energy saving protocol for wireless sensor networks
PDF
Plant leaf identification system using convolutional neural network
PDF
Customized moodle-based learning management system for socially disadvantaged...
PDF
Understanding the role of individual learner in adaptive and personalized e-l...
PDF
Prototype mobile contactless transaction system in traditional markets to sup...
PDF
Wireless HART stack using multiprocessor technique with laxity algorithm
PDF
Implementation of double-layer loaded on octagon microstrip yagi antenna
PDF
The calculation of the field of an antenna located near the human head
PDF
Exact secure outage probability performance of uplinkdownlink multiple access...
PDF
Design of a dual-band antenna for energy harvesting application
PDF
Transforming data-centric eXtensible markup language into relational database...
PDF
Key performance requirement of future next wireless networks (6G)
PDF
Noise resistance territorial intensity-based optical flow using inverse confi...
PDF
Modeling climate phenomenon with software grids analysis and display system i...
PDF
An approach of re-organizing input dataset to enhance the quality of emotion ...
PDF
Parking detection system using background subtraction and HSV color segmentation
PDF
Quality of service performances of video and voice transmission in universal ...
Square transposition: an approach to the transposition process in block cipher
Hyper-parameter optimization of convolutional neural network based on particl...
Supervised machine learning based liver disease prediction approach with LASS...
A secure and energy saving protocol for wireless sensor networks
Plant leaf identification system using convolutional neural network
Customized moodle-based learning management system for socially disadvantaged...
Understanding the role of individual learner in adaptive and personalized e-l...
Prototype mobile contactless transaction system in traditional markets to sup...
Wireless HART stack using multiprocessor technique with laxity algorithm
Implementation of double-layer loaded on octagon microstrip yagi antenna
The calculation of the field of an antenna located near the human head
Exact secure outage probability performance of uplinkdownlink multiple access...
Design of a dual-band antenna for energy harvesting application
Transforming data-centric eXtensible markup language into relational database...
Key performance requirement of future next wireless networks (6G)
Noise resistance territorial intensity-based optical flow using inverse confi...
Modeling climate phenomenon with software grids analysis and display system i...
An approach of re-organizing input dataset to enhance the quality of emotion ...
Parking detection system using background subtraction and HSV color segmentation
Quality of service performances of video and voice transmission in universal ...

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Construction Project Organization Group 2.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
Mechanical Engineering MATERIALS Selection
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Sustainable Sites - Green Building Construction
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Well-logging-methods_new................
PPTX
Welding lecture in detail for understanding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Construction Project Organization Group 2.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Mechanical Engineering MATERIALS Selection
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
UNIT 4 Total Quality Management .pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Sustainable Sites - Green Building Construction
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
bas. eng. economics group 4 presentation 1.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Well-logging-methods_new................
Welding lecture in detail for understanding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

Development of 3D convolutional neural network to recognize human activities using moderate computation machine

  • 1. Bulletin of Electrical Engineering and Informatics Vol. 10, No. 6, December 2021, pp. 3137~3146 ISSN: 2302-9285, DOI: 10.11591/eei.v10i6.2802 3137 Journal homepage: http://beei.org Development of 3D convolutional neural network to recognize human activities using moderate computation machine Malik A. Alsaedi, Abdulrahman S. Mohialdeen, Baraa M. Albaker College of Engineering, Al-Iraqia University, Sabe’ Abkar, Adhamiya, Baghdad, Iraq Article Info ABSTRACT Article history: Received Jan 12, 2021 Revised May 20, 2021 Accepted Oct 9, 2021 Human activity recognition (HAR) is recently used in numerous applications including smart homes to monitor human behavior, automate homes according to human activities, entertainment, falling detection, violence detection, and people care. Vision-based recognition is the most powerful method widely used in HAR systems implementation due to its characteristics in recognizing complex human activities. This paper addresses the design of a 3D convolutional neural network (3D-CNN) model that can be used in smart homes to identify several numbers of activities. The model is trained using KTH dataset that contains activities like (walking, running, jogging, handwaving handclapping, boxing). Despite the challenges of this method due to the effectiveness of the lamination, background variation, and human body variety, the proposed model reached an accuracy of 93.33%. The model was implemented, trained and tested using moderate computation machine and the results show that the proposal was successfully capable to recognize human activities with reasonable computations. Keywords: 3D-CNN Convolutional neural network Deep learning HAR Smart home Vision This is an open access article under the CC BY-SA license. Corresponding Author: Abdulrahman S. Mohialdeen College of Engineering Al-Iraqia University Sabe’ Abkar, Adhamiya, Baghdad, Iraq Email: abd_saeed@aliraqia.edu.iq 1. INTRODUCTION HAR is one of the challenging subjects because of the huge number of human activities, some of the activities can be easily noticed some of them are confusing, and some of them require interaction with other objects or humans, besides the diversity of the activities, the recognition methods are also diverse. There are many types of data required to recognize a human activity, some of them use ambient sensors like accelerometer, gyroscope, humidity, and temperature [1], [2]. Some of them get the benefit of the smartphone's sensors like accelerometer and gyroscope [3], [4], the other use the radio frequency [5]. But the most popular recognizing methods use vision-based recognition [6]-[14]. Vision uses images or videos to recognize the activity. Also, there is a lot of challenges for recognizing human activities because of the effect of lamination, variance of background. Still, the question is how to process these visual data to recognize the activity. The answer there are many techniques most of them use machine learning, and deep learning has shown an excellent benefit for recognizing human activities, especially the CNN which are very useful for vision-based data recognition. In this paper, we will design a neural network architect a.k.a. model, that can be used for human activity monitoring, the model proposed of 3D dimensional CNN (3D-CNN), and the purpose of using 3D-CNN is to extract spatial and temporal features rather than only spatial features, and the activity consists
  • 2.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146 3138 of multiple movements that can be identified by extracting the temporal features from between several numbers of frames. Our proposed model is a small number of 3D-CNN layers to reduce the amount of processing time so that any computer with low computational ability could recognize human activity online without delay. CNN was introduced by Fukushima [12]. At the beginning, it was mainly used for image classification, and image features extraction, one of the most attractive models that been invented to get into a challenge to classify images [13], [14]. During that many video human activity datasets were published and KTH [15] is one of the most popular small size datasets. CNN architect encouraged researchers to use it in HAR with image data taken from video dataset [10], [16]. Other researchers proposed to use two-stream CNN where the first stream is an image data and the second it's optical flow [16], [17] 3D-CNN for HAR most cited articles [18] proposed model with TRECVID dataset, and [19] proposed 3D-CNN model for UCF-101 dataset [20] in which our model inspired by their model. 2. RESEARCH METHOD This paper proposing to use deep neural networks (DNN) models that consists of many different layers, and each layer has its purpose during training and testing. Still, the dominant layer in which the article focused on is the Convolutional neural network, which is one of the most widely used neural networks, and its central idea is applying filters to the input data or convolute it, and transfer the convoluted data to the next layer. CNN was primarily used to deal with image data. Now there are 1D, 2D, and 3D CNN to get the benefit of this architect for another type of data with a different number of dimensions. 3D-CNN used for three-dimensional data which is very suited for our project, because we are dealing with video data, and the reason for using video data rather than image data is the activity made of several consequential movements of body parts. This continuous movement can be noticed with successive images, and this is a video. Pooling layers used in this paper are max-pooling, which returns the maximum value within a kernel size when it wraps around the data, average-pooling which returns the average amount within a kernel size that wraps along with the data, and global-average-pooling, which returns the average value of each dimension or each kernel in the CNN layer, and that is why it is useful at the final part of the model to reduce the number of parameters, and to help the model overcome overfitting dropout layers are used. 2.1. Proposed models The first suggested model shown in Figure 1 is influenced by the model proposed by Tran et al. [19]. The model consists of three connected 3D convolutional neural networks with 3*3*3 kernel size for all convolutional layers, 6ed by a max-pooling layer with a kernel size of 2*2*2 for all max-pooling layers. Zero-padding is used for all convolutional layers, and ‘ReLU’ activation function added after each convolutional and fully connected (FC) layers except for the last was SoftMax. Figure 1. Initial proposed model In the second attempt, other convolutional and max-pooling layers added before flatten layer, to increase the accuracy. Then dropouts were added in different places in the model with five attempts to get high accuracy, where all dropouts were with a 0.5 percentage factor. We try with our model to reduce the number of parameters (weights and biases) and return to the first model with dropouts. The accuracy increased using the dropouts. But, the number of parameters in huge, because the output of the layer before flatten layer was more extensive than before removing convolutional and pooling layers, and as the pooling layer is gone the number of parameters increased, in (1)-(3) shows how the number of parameters calculated. Input 30*40*40*1 Conv3D 64 MaxPool Conv3D 128 MaxPool Conv3D 256 MaxPool Flatten FC 128 FC 32 FC 6 SoftMax Output 6 classes
  • 3. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Development of 3D convolutional neural network to recognize human … (Malik A. Alsaedi) 3139 So, to solve this problem GlobalAveragePooling3D layer replaced the flatten layer, this replacement reduced the number of parameters to about two million parameters, which is very helpful for low or moderate computation capabilities machines to deal with online activity recognition. No. of parameters in CNN=(filterwidth * filterheight * filterdepth + 1) * no. of filters (1) No. of parameters in FC net=neurons of current layer * neurons of previous layer (2) No. of parameters in flatten=multiplication of all previous layer dimension (3) 3. RESULTS AND DISCUSSION The neural network model is trained on the KTH dataset, the dataset is doubled before using it by adding a flipped copy of it, 20% of data is taken for test after training, and another 20% taken for the test during the training, and 60% were taken for the training process. The model was trained using Tensorflow- v2.1 [21] as backend, and Keras-v2.3.1 [22] using Python-language-v3.7.5 as front-end, with batch size=16 and the shape of the frame taken from the video were 40x40 pixels with one channel (grayscale), from each video 30 frames were taken between each frame and another there were four frames between taken frames discarded. The optimizer of the model was Adam optimizer [23], and with a learning rate of 0.001 and categorical cross-entropy loss function, the machine specification was: HP 15 Notebook, Memory: 12288MB RAM, Intel-Core i5-3230M processor with four cores, two of them are physical, and the maximum frequency is 2.6GHz and Windows 8.1 Enterprise 64-bit OS (6.3, build 9600). Validation data controls the training operation. So, if an update made to the model and validation data applied to the model and the losses did not improved for three epochs for the learning rate would be multiplied by a half and the minimum reduction is 0.0001. If the losses did not improve for 15 epochs consequently, the training would be finished before getting to the given number of epochs is 100. 3.1. Calculating results After training operation finishes test samples are pushed to the model to get the response the accuracy, precision, recall, and f1_score are calculated using (4)-(7) which uses confusion matrix shown in Figure 2 [24], for average loss is calculated using categorical cross-entropy algorithm [25]. Figure 2. Confusion matrix annotation Accuracy= ∑ tpi l i=1 +tni ∑ (tpi+fni+fpi+tni ) l i=1 (4) In (5) shows the way to calculate accuracy which defines the effectiveness of the model overall. Precision= ∑ tpi l i=1 ∑ (tpi+fpi) l i=1 (5) According to (5) shows the way to calculate precision which determines the matching between the label of classes and the calculated labels. In (6) shows the way to calculate recall which demonstrates the effectiveness of the model to identify the label of classes. As shown in (7) shows the way to calculate F1_score which defines the relation between output data taken from the model after entering data for test and the positive labels. In (8) show the way to calculate the average loss, where N is the number of samples, M is the number of classes, d is the true label or desired output, and y is the calculated or tested output from the model. Table 1 shows the calculated results, and it’s figure number. Table 2 shows a comparison of accuracies for several studies done on the KTH dataset for human activity recognition and our study accuracy, and it is evident that our method has shown a remarkable improvement according to accuracy.
  • 4.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146 3140 Recall= ∑ tpi l i=1 ∑ (tpi+fni) l i=1 (6) F1_score= (β2+1)∗Precision+Recall β2∗Precision+Recall (8) Loss= ∑ − 1 M ∑ di,jlog⁡(yi,j) M j=1 N i=1 N (9) Table 1. Calculated results for all models No. No. of figure Accuracy % Loss Precision % Recall % F1_score % 1. Figure 3 85.83 0.39 86.30 85.89 86.09 2. Figure 4 88.75 0.39 88.97 88.97 88.97 3. Figure 5 92.08 0.23 92.67 91.90 92.28 4. Figure 6 92.92 0.22 93.08 92.95 93.02 Table 2. Comparison of accuracies of researches done on KTH No. Method Accuracy % 1 Ahmad and Lee [26] 84.83 2 Taylor et al. [27] 88.00 3 Qian et al. [28] 88.69 4 Our method 93.33 Number the table consecutively according to the first mention (sequential order). 3.2. Calculating the number of operations for layers We want to calculate an approximate number of operations for each layer, the calculations don’t include controlling operations, calculations are detailed below: a. 3D-CNN each kernel convolutes on the entire input data, and for 3D-CNN with a kernel size of (Kd, Kh, Kw), input data of (frames, height, width) and strides are (strided, strideh, stridew), we would have: No. of operations per node=(Kd * Kh * Kw + 1)2 * no. of previous kernels (9) Output nodes=((frames-Kd)/strided+1) * (height-Kh)/strideh+1) *((width- Kw)/stridew+1) *no. of current kernels (10) Output nodes=(frames/strided+1) * (height/strideh+1) * (width/stridew+1) * no. of current kernels (11) In (9) shows the number of operations for each output node came from the convolution operation and the power two because we have multiplications, in (10) [29] shows the number of output node and for each output node we have. For 3D-CNN layer, but for no zero paddings, if we use padding which is used in our proposed model the number of operation would be as shown in (11) [29]. b. Maxpooling3D performs comparison operation, for a pool window size of (Pd, Ph, Pw), input data of (frames, height, width) and strides are (strided, strideh, stridew), we would have: No. of operations per node=Pd * Ph * Pw (12) Output nodes=((frames-Pd)/strided+1) * ((height-Ph)/strideh+1) * ((width- Pw)/stridew+1) no. of current kernels (13) Output nodes=(frames/strided+1) * (height/strideh+1) * (width/stridew+1) * current kernels (14) In (12) shows the number of operations for each pool window, in (13) shows the number of output node without padding which is used in our model and the size of the stride are the same pool window size, in (14) shows the number of output nodes if there are zero paddings. c. Fully connected has a vast number of parameters as compared with CNN if the number of neurons of the previous layer is Nprevious and the number of neurons of the current layer is Ncurrent. In (15) shows the number of operations for a fully connected layer, power two because we have multiplications.
  • 5. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Development of 3D convolutional neural network to recognize human … (Malik A. Alsaedi) 3141 Number of operations=(Nprevious * Ncurrent)2 (15) d. Flatten has only one operation which is reshaping the dimensions into one dimension. e. Dropout works only during training operation, and it’s just hidden randomly chosen a portion of nodes so as not to participate in producing the output at only some point in the training, so for testing it doesn’t cost any operations. f. GlobalAveragePooling3D adds nodes for each channel where the previous layer output is (frames, height, width, channels), so it adds all the numbers in frames, height, and width for a particular channel and divides by the number of (frames * height * width), so the total number of operations is shown in (16). Number of operations=(frames * height * width) * channels (16) Table 3 shows the number of operations for each model proposed according to their figures, and we can see the least number of operations is the model with the least number of parameters and high accuracy of 92.92%. Table 3. Number of operations for each model according to its figure No. of figure No. of operations Figures 3 1.08 * 1013 Figures 4, 5 5.77 * 1012 Figure 6 4.77 * 1012 4. DISCUSSION Each result is discussed according to the number of figures. Figure 4 shows model architect, confusion matrix, accuracy, and loss figures for the earlier proposed model. We can notice that because the last before flatten were not small we got a massive number of the parameters for the fully connected layer, also we can see that the model’s learning was saturated in early time which completed learning within 30 epochs. Figure 5 shows model architect, confusion matrix, accuracy and loss figures for the proposed model after adding Conv3D and MaxPooling before flatten to reduce the number of parameters, and the training time and the training finished within 30 epochs. Figure 6, shows model architect, confusion matrix, accuracy, and loss figures for the model after adding dropouts after the fourth, sixth, and eighth layers, the accuracy for this model shown remarkable improvement and the training finished in epoch 100 which is the final demanded epoch. Figure 6 shows model architect, confusion matrix, accuracy, and loss figures for the model after changing flatten layer with GlobalAveragePooling3D, which reduced the number of parameters and so the training time per epoch. It also reduced the number of operations during testing and got a fantastic accuracy of 92.92%, the training was finished in epoch 82. Layer (type) Output shape Parameters Conv3D 30, 40, 40, 64 1792 MaxPooling3D 15, 20, 20, 64 0 Conv3D 15, 20, 20, 128 221312 MaxPooling3D 7, 10, 10, 128 0 Conv3D 7, 10, 10, 256 884992 MaxPooling3D 3, 5, 5, 256 0 Flatten 19200 19200 Dense 128 2457728 Dense 32 4128 Dense 6 198 Total Parameters 3,570,150 (a) (b) Figure 3. Earlier proposed model; (a) model architecture, (b) confusion matrix
  • 6.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146 3142 (c) (d) Figure 4. Earlier proposed model; (c) accuracy of training and validation, (d) losses of training and validation (continue) Layer (type) Output shape Parameters Conv3D 30, 40, 40, 64 1792 MaxPooling3D 15, 20, 20, 64 0 Conv3D 15, 20, 20, 128 221312 MaxPooling3D 7, 10, 10, 128 0 Conv3D 7, 10, 10, 256 884992 MaxPooling3D 3, 5, 5, 256 0 Conv3D 3, 5, 5, 256 1769728 MaxPooling3D 1, 2, 2, 256 0 Flatten 1024 0 Dense 128 131200 Dense 32 4128 Dense 6 198 Total Parameters 3,013,350 (a) (b) (c) (d) Figure 5. Adding fourth convolutional and max-pooling layers; (a) model architecture, (b) confusion matrix, (c) accuracy of training and validation, (d) losses of training and validation
  • 7. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Development of 3D convolutional neural network to recognize human … (Malik A. Alsaedi) 3143 Layer (type) Output shape Parameters Conv3D 30, 40, 40, 64 1792 MaxPooling3D 15, 20, 20, 64 0 Conv3D 15, 20, 20, 128 221312 MaxPooling3D 7, 10, 10, 128 0 Dropout (0.5) 7, 10, 10, 128 0 Conv3D 7, 10, 10, 256 884992 MaxPooling3D 3, 5, 5, 256 0 Dropout (0.5) 3, 5, 5, 256 0 Conv3D 3, 5, 5, 256 1769728 MaxPooling3D 1, 2, 2, 256 0 Dropout (0.5) 1, 2, 2, 256 0 Flatten 1024 0 Dense 128 131200 Dense 32 4128 Dense 6 198 Total Parameters 3,013,350 (a) (b) (c) (d) Figure 6. Dropouts before third and fourth convolutional and also before flatten layer; (a) model architecture, (b) confusion matrix, (c) accuracy of training and validation, (d) losses of training and validation Layer (type) Output shape Parameters Conv3D 30, 40, 40, 64 1792 MaxPooling3 15, 20, 20, 64 0 Conv3D 15, 20, 20, 128 221312 MaxPooling3 7, 10, 10, 128 0 Conv3D 7, 10, 10, 256 884992 MaxPooling3 3, 5, 5, 256 0 Dropout (0.5) 3, 5, 5, 256 0 GlobalAVG3D 256 0 Dropout (0.5) 256 0 Dense 128 32896 Dense 32 4128 Dense 6 198 Total parameters 1,145,318 (a) (b) Figure 6. Replacing flatten with global average pooling; (a) model architecture, (b) confusion matrix
  • 8.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146 3144 (c) (d) Figure 6. Replacing flatten with global average pooling; (c) accuracy of training and validation, (d) losses of training and validation (continue) We can see that dropout has great benefit, but this benefit can’t be taken unless when we put dropout in the right place, we can see that there were several changes to the place and number of dropouts when had seen that dropouts increases the accuracy when it was added before the third convolutional and flatten layers but decreased slightly decreased when added before fourth convolutional layer. Then the place of dropouts was changed to be before and after flatten layer, in which we got the maximum accuracy, after that this increasing tested for the model with a smaller number of layers for the aim of decreasing the number of parameters. The results were helpful, then complete the operation of parameters decreasing, flatten layer has been replaced by Global-average-pooling, which reduced the number of parameters for the model by two million parameters. 5. CONCLUSION We have designed a model that can be used for online human activity recognition using moderate computation machine. The accuracy of our proposed model was raised to 93.33%, and 92.92% for the model with reduced amount of parameters. The last presented model is useful for moderate computation capabilities machines, due to its low number of parameters and a low number of mathematical operations. We have reached this high accuracy by getting the benefit of dropouts, and decreasing learning rate during training when there is no improvement. The model with a low number of mathematical operations could be used for online human activity recognition in a smart houses, helping monitoring human activities in the houses. We intend to do more augmentation for the data to increase the overall accuracy, where only flipping augmentation is made to the data. REFERENCES [1] X. Zhou, W. Liang, K. I. Wang, H. Wang, L. T. Yang and Q. Jin, "Deep-Learning-Enhanced Human Activity Recognition for Internet of Healthcare Things," in IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6429-6438, July 2020, doi: 10.1109/JIOT.2020.2985082.. [2] V. Bianchi, M. Bassoli, G. Lombardo, P. Fornacciari, M. Mordonini and I. De Munari, "IoT Wearable Sensor and Deep Learning: An Integrated Approach for Personalized Human Activity Recognition in a Smart Home Environment," in IEEE Internet of Things Journal, vol. 6, no. 5, pp. 8553-8562, Oct. 2019, doi: 10.1109/JIOT.2019.2920283. [3] A. K. M. Masum, A. Barua, E. H. Bahadur, M. R. Alam, M. A. U. Z. Chowdhury and M. S. Alam, "Human Activity Recognition Using Multiple Smartphone Sensors," 2018 International Conference on Innovations in Science, Engineering and Technology (ICISET), 2018, pp. 468-473, doi: 10.1109/ICISET.2018.8745628. [4] M. M. Hassan, M. Z. Uddin, A. Mohamed and A. Almogren, “A robust human activity recognition system using smartphone sensors and deep learning,” Future Generation Computer Systems, vol. 81, pp. 307-313, 2018, doi: 10.1016/j.future.2017.11.029.
  • 9. Bulletin of Electr Eng & Inf ISSN: 2302-9285  Development of 3D convolutional neural network to recognize human … (Malik A. Alsaedi) 3145 [5] X. Wu, Z. Chu, P. Yang, C. Xiang, X. Zheng and W. Huang, "TW-See: Human Activity Recognition Through the Wall With Commodity Wi-Fi Devices," in IEEE Transactions on Vehicular Technology, vol. 68, no. 1, pp. 306- 319, Jan. 2019, doi: 10.1109/TVT.2018.2878754. [6] A. Diba et al., “Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification,” Computer Science, 2017. [7] T. Lima, B. Fernandes and P. Barros, "Human action recognition with 3D convolutional neural network," 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI), 2017, pp. 1-6, doi: 10.1109/LA- CCI.2017.8285700. [8] J. Carreira and A. Zisserman, “Quo Vadis, action recognition? A new model and the kinetics dataset,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2017-Janua, pp. 4724–4733, 2017. [9] R. Singh, A. K. S. Kushwaha and R. Srivastava, “Multi-view recognition system for human activity based on multiple features for video surveillance system,” Multimedia Tools and Applications, vol. 78, no. 12, pp. 17165- 17196, 2019, doi: 10.1007/s11042-018-7108-9. [10] H. D. Mehr and H. Polat, "Human Activity Recognition in Smart Home With Deep Learning Approach," 2019 7th International Istanbul Smart Grids and Cities Congress and Fair (ICSG), 2019, pp. 149-153, doi: 10.1109/SGCF.2019.8782290. [11] Z. Tu et al., “Multi-stream CNN: Learning representations based on human-related regions for action recognition,” Pattern Recognition, vol. 79, pp. 32-43, 2018, doi: 10.1016/j.patcog.2018.01.020. [12] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, no. 4, pp. 193-202, 1980, doi: 10.1007/BF00344251. [13] J J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255, doi: 10.1109/CVPR.2009.5206848. [14] A. Krizhevsky, I. Sutskever and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 60, no. 6, pp. 84-90, 2017, doi: 10.1145/3065386. [15] “KTH dataset,” 2005. [Online]. Available: https://www.csc.kth.se/cvap/actions/. [Accessed: 27-May-2020]. [16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, "Large-Scale Video Classification with Convolutional Neural Networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725-1732, doi: 10.1109/CVPR.2014.223. [17] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 1, no. 1, pp. 568- 576, 2014, doi: 10.5555/2968826.2968890. [18] S. Ji, W. Xu, M. Yang and K. Yu, "3D Convolutional Neural Networks for Human Action Recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, Jan. 2013, doi: 10.1109/TPAMI.2012.59. [19] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2015 Inter, pp. 4489-4497, 2015. [20] K. Soomro, A. R. Zamir and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild,” Computer Vision and Pattern Recognition, no. November, 2012. [21] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” Distributed, Parallel, and Cluster Computing, 2016. [22] François Chollet, “Keras,” 2015. [Online]. Available: https://keras.io/. [Accessed: 08-Jun-2020]. [23] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” Computer Science, Mathematics, pp. 1- 15, 2015. [24] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Information Processing & Management, vol. 45, no. 4, pp. 427-437, 2009, doi: 10.1016/j.ipm.2009.03.002. [25] Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018, vol. 2018-Decem, no. NeurIPS, pp. 8778-8788, doi: 10.5555/3327546.3327555. [26] M. Ahmad and S. W. Lee, “Human action recognition using shape and CLG-motion flow from multi-view image sequences,” Pattern Recognition, vol. 41, no. 7, pp. 2237-2252, 2008, doi: 10.1016/j.patcog.2007.12.008. [27] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional learning of spatio-temporal features,” European conference on computer vision, Springer, Berlin, Heidelberg, 2010, vol. 6316 LNCS, no. PART 6, pp. 140-153, doi: 10.1007/978-3-642-15567-3_11. [28] H. Qian, Y. Mao, W. Xiang and Z. Wang, “Recognition of human activities using SVM multi-class classifier,” Pattern Recognition Letters, vol. 31, no. 2, pp. 100-111, 2010, doi: 10.1016/j.patrec.2009.09.019. [29] I. Vasilev, D. Slater, G. Spacagna, P. Roelants and V. Zocca, "Python Deep Learning," 2nd Editio. Birmingham: Packt Publishing, 2019.
  • 10.  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 10, No. 6, December 2021 : 3137 – 3146 3146 BIOGRAPHIES OF AUTHORS Malik Alsaedi is Asst. Prof. of electrical engineering. He finished his B.Sc. degree from University of Technology University Baghdad, the M.Tch. degree from JNTU University India and the Ph.D. degree from UTM university Malaysia. Currently position a deputy dean of engineering college Al-Iraqia University Iraq. He is interested in optical communication and IoT technology. Abdulrahman S. Mohialdeen has a Bachelor degree in Electrical Engineering from University of Baghdad, Master degree in Computer Engineering from Al-Iraqia University, research interest in deep learning, human activity recognition, and computer vision. Baraa Munqith Albaker received both B.Sc. degree in electrical engineering and M.Sc. degree in computer and control engineering from University of Baghdad, Iraq, and Ph.D. degree in control engineering from University of Malaya, Malaysia. He had worked in industry on data acquisition systems and radar signal processing and analysis for over three years. He was a lecturer at University of Baghdad for four years. Next, he was a senior lecturer of UMPEDAC research Centre, University of Malaya for two years. Currently, he works as head of Networks Engineering department at Al-Iraqia University. His research interests focus on contemporary development in computer and control applications.