SlideShare a Scribd company logo
TELKOMNIKA, Vol.17, No.6, December 2019, pp.3010~3018
ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018
DOI: 10.12928/TELKOMNIKA.v17i6.12701 ◼ 3010
Received March 18, 2019; Revised July 2, 2019; Accepted July 18, 2019
Batik image retrieval using convolutional
neural network
Heri Prasetyo*1
, Berton Arie Putra Akardihas2
Department of Informatics, Universitas Sebelas Maret (UNS), Surakarta, Indonesia
*Corresponding author, e-mail: heri.prasetyo@staff.uns.ac.id1
, bertonarie123@student.uns.ac.id2
Abstract
This paper presents a simple technique for performing Batik image retrieval using
the Convolutional Neural Network (CNN) approach. Two CNN models, i.e. supervised and unsupervised
learning approach, are considered to perform end-to-end feature extraction in order to describe the content
of Batik image. The distance metrics measure the similarity between the query and target images in
database based on the feature generated from CNN architecture. As reported in the experimental section,
the proposed supervised CNN model achieves better performance compared to unsupervised CNN in
the Batik image retrieval system. In addition, image feature composed from the proposed CNN model
yields better performance compared to that of the handcrafted feature descriptor. Yet, it demonstrates
the superiority performance of deep learning-based approach in the Batik image retrieval system.
Keywords: autoencoder, CNN, deep learning, feature extraction, image retrieval
Copyright © 2019 Universitas Ahmad Dahlan. All rights reserved.
1. Introduction
The brain is an amazing organ in the human body. With our brains, we can understand
what we see, smell, taste, hear and touch. The infant brain weight is only about half a kilogram
but can solve a big problem, and even supercomputers cannot. After several months of birth,
the baby can recognize the face of his parents, discern discrete objects from the background,
and begin to speak. Within one year the baby has an intuition about natural objects, can follow
objects and understand the meaning of a sound. When they are children, they can understand
grammar and have thousands of words in their vocabulary.
Building machines that have intelligence like our brains are not easy, to make machines
with artificial intelligence we have to solve very complex computing problems that we have even
struggled with, problems that our brains can solve in a matter of seconds. To overcome this
problem, we have to develop other ways to program computers that have been used in this
decade. Therefore there arises an active field of artificial computer intelligence and also
commonly called deep learning [1].
Nowadays Artificial intelligence has undergone very rapid development. Ai has been
used in many fields of research, in the field of computer vision Content-Based Image Retrieval
(CBIR) has been developed in multi-level schemes with low-level features to high-level features.
Convolutional Neural Network (CNN) has been successfully used to be an effective descriptor
feature and gain accurate results. In general, the features gain by the deep learning method are
trained by mimic human perceptions through various operations such as convolution and
pooling. Deep learning has become a descriptor feature that is better than low-level features.
Although now the CNN module has become state of the art in computer vision this does not
guarantee the features obtained from the highest level always get the best performance [2].
In the Content-Based Image Retrieval system aims to provide the right way to do
the browsing, retrieving and searching some desired images that have been stored in the image
database. The image database contains many images that have been stored and arranged in
a storage device. Usually, the size of the image database is very large so that the process of
searching for specific images manually requires a lot of time, and causes conditions that are
uncomfortable for the user. For example, Batik is a cultural heritage of the archipelago
Indonesia that has a high value and blend of art, laden with philosophical meanings
and meaningful symbols that show the way of thinking of the people making it. Batik is a craft
that has been a part of Indonesian culture especially Javanese for a long time, batik have
TELKOMNIKA ISSN: 1693-6930 ◼
Batik image retrieval using convolutional neural network (Heri Prasetyo)
3011
a lot of motives, pattern and color so to take specific batik picture from the database very
challenging [3].
This paper offers a solution to use convolutional neural networks to carry out
CBIR tasks to solve problems that occur in taking batik images. The method intended is
to produce effective image descriptors from the CNN architecture. Descriptors of this feature
are very important for content-based shooting systems. The Image feature is used to improve
the performance and to solve problems in existing batik shooting systems.
2. Content-based Image Retrieval System
Image retrieval is a computer system for searching and retrieving a specific image
in large or big size of image databases. The classical approach appends on the metadata
such as texts, keywords, or descriptions embedded in an image. Thus, the image retrieval
can be performed with the search key as aforementioned text, keywords, etc. This technique
is inefficient since the manual image annotations are time-consuming and exhausting
process. Even though, large amounts of automatic images annotations have been proposed in
literature [4], an image retrieval system with content annotation still cannot deliver
satisfactory result.
CBIR is computer application dealing with the searching problems over large-scale
image database. CBIR, also recognized as Query-Based Image Content (QBIC) and
Content-Based Visual Information Search (CBVIR), differs with the content-based approach.
The CBIR analyzes the image content rather than metadata information such image keywords,
tags, or image descriptions [5].
In this paper, the usability of CNN model is extended to the CBIR task. The main reason
is the superiority performance offered by CNN model compared to the handcrafted feature in
the computer vision and recognition tasks. The CNN or Deep Learn network achieves
the outstanding retrieval performance in the ImageNet challenge [6]. The CNN model inspires
the other deep learning-based approaches, such as AlexNet [7], VGGNet [8], GoogleLeNet [9],
Microsoft ResNet [10], etc., to tackle the obsolete of handcrafted feature in the image
retrieval domain.
The CNN model receives a three-dimensional image of size ℎ × 𝑤 × 𝑑, where ℎ and 𝑤
are spatial dimensions and 𝑑 is the number of channels. This image is further processed
thorough the CNN architecture consisting several convolutions, max-poolings, and activation
functions to perform end-to-end image feature generation. Let 𝑋𝑖𝑗 be a vector data located at
spatial position (𝑖, 𝑗) in specific layers. The CNN computes a new data 𝑌𝑖𝑗 as follow:
Yij = fks({Xsi+δi,sj+δj} 0≤δi,δj<k) (1)
where 𝑘 and 𝑠 denote kernel size and stride, respectively. The function 𝑓𝑘𝑠 is the layer type used
such as matrix dot multiplication for convolutional layers, max spatial for max pooling layers,
nonlinear functions for activation functions, and other types of layers. This form of functionality is
maintained using kernel size and step composition while still using the transformation rules.
fks°gk′s′ = (f°g)k′+(k−1)s′,ss′ (2)
While a general network computes general nonlinear functions, a network with only
layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional
network. FCN naturally operates at any size input and produces the appropriate spatial
dimensions. The loss function is valued composed with the FCN defines task. If the loss
function is a sum over the spatial dimensions of the final layer 𝑙(𝑥; 𝜃) = ∑ 𝑙′
𝑖𝑗 (𝑥𝑖𝑗; 𝜃),
the parameter gradient will be a sum over the parameter gradients of each of its spatial
components. Thus stochastic gradient on 𝑙 computed on whole images will be the same as
the stochastic gradient on 𝑙′, taking all the final receptive fields as minibatch. When calculating
this receptive field is done repeatedly with forward and backward propagation operations
feedback will be more effective if the calculation is done layer by layer in all images compared to
computing patch by patch to the part of the image. An illustration of a CNN operation can be
seen in Figure 1.
◼ ISSN: 1693-6930
TELKOMNIKA Vol. 17, No. 6, December 2019: 3010-3018
3012
The proposed CNN model constructs the feature descriptor from Batik image. This
feature descriptor is to measure the similarity between query and target images in database
under the K-Nearest Neighbors (KNN) [11] strategy. This KNN technique performs similarity
matching with the distance score criterion. This paper investigates two CNN models in
the training stage, i.e. with supervised and unsupervised learning approaches. Figure 1
illustrates an example of proposed supervised CNN architecture for Batik image retrieval.
The supervised terminology refers to the utilization of class label, whereas unsupervised
disobeys the image label in the training process. Autoencoder is simple example of
unsupervised CNN method which compresses the data features into smaller size and recovers
back to the original data [12].
Figure 1. Ilustration operation using CNN
3. Method
This section presents two methods for generating the feature descriptor in the Batik
image retrieval system. We firstly explain the supervised CNN model. Then, the unsupervised
CAE model [13] is subsequently described in this section.
3.1. Supervised Learning
The CNN model is the supervised deep learning-based approach commonly used in
the image classification [14], prediction [15], segmentation, analysis [16], etc. The supervised
CNN model consists of several layers such as convolutional layer, max pooling layer, etc.
These layers are repeated over several times and fed into the fully connected layer at the end of
CNN layer [17]. Our proposed image retrieval system employs the CNN architecture with six
convolutional layers and two fully connected layers to generate Batik feature descriptor. Table 1
summarizes the CNN architecture used in our proposed method.
Table 1. The Supervised CNN Architecture for Batik Image Retrieval
Layer Type Size Output Shape
Input (128,128,3) -
Convolutional + Relu 8 (3x3) filters, 1 stride, 2 padding (128,128,8)
Max Pooling 8 (2x2) filters, 2 stride, 0 padding (64,64,8)
Convolutional + Relu 16 (3x3) filters, 1 stride, 2 padding (64,64,16)
Max Pooling 16 (2x2) filters, 2 stride, 0 padding (32,32,16)
Convolutional + Relu 32 (3x3) filters, 1 stride, 2 padding (32,32,32)
Max Pooling 32 (2x2) filters, 2 stride, 0 padding (16,16,32)
Convolutional + Relu 64 (3x3) filters, 1 stride, 2 padding (16,16,64)
Max Pooling 64 (2x2) filters, 2 stride, 0 padding (8,8,64)
Convolutional + Relu 128 (3x3) filters, 1 stride, 2 padding (8,8,128)
Max Pooling 128 (2x2) filters, 2 stride, 0 padding (4,4,128)
Convolutional + Relu 256 (3x3) filters, 1 stride, 2 padding (4,4,256)
Max Pooling 256 (2x2) filters, 2 stride, 0 padding (2,2,256)
Flatern + Dropout (30%) (1,1,1024) 1024 neurons 256
Dense 256 neurons 97
Softmax 97 way 97
TELKOMNIKA ISSN: 1693-6930 ◼
Batik image retrieval using convolutional neural network (Heri Prasetyo)
3013
After performing six convolution and max-pooling operations, an input image of size
128 × 128 × 3 is converted into new representation with dimensionality 2 × 2 × 256. This new
data representation is then flatten to become one dimensional data of size 1 × 1 × 1024. This
flatten data is subsequently processed and trained with the Multi-Layer Perceptron (MLP).
Herein, the MLP receives 1024 input feature and feeds into 1024 input neurons. The hidden and
output layers are set as 256 and 97, respectively. The value of 97 in output layers is equivalent
to that of the desired class target, i.e. the number of Batik image classes used in the proposed
image retrieval system.
3.2. CAE Unsupervised Learning
This paper also considers the other CNN model, namely Convolutional Auto-Encoder
(CAE), for generating image feature. The CAE is an unsupervised deep learning-based method,
i.e. the image label is not required in the training process. In order to generate image feature,
this technique learns and captures the information from input data directly without the availability
of class label.
The CAE involves two parts, i.e. encoder and decoder blocks. The encoder block
processes the sample data 𝑋 consisting 𝑛 samples and 𝑚 features to yield the output 𝑌. In
the opposite side, the decoder aims to reconstruct the original sample data 𝑋 from the 𝑌. Let 𝑋′
be the reconstructed data produced at the decoder side. The main goal of CAE is to minimize
the difference between the original data 𝑋 and reconstructed version 𝑋′. Specifically,
the encoder simply maps the input 𝑋 into new representation 𝑌 with the help of function 𝑓. This
process can be formulated as follow:
Y = f(X) = sf (WX + bX) (3)
where 𝑠𝑓 denotes the nonlinear activation function in encoder side. CAE simply performs
a linear operation if one simply uses identity function for 𝑠𝑓. The 𝑊 and 𝑏 𝑋 ∈ 𝑅 𝑛
are encoder
parameters, respectively, referring as weight matrix and bias vector. In contrast, the decoder
reconstructs 𝑋′ from 𝑌 representation by means of function 𝑔. This process can be simply
illustrated as:
X′
= g(Y) = sg (W′
Y + bY) (4)
where 𝑠𝑔 represents the activation function in decoder side. The 𝑏 𝑌 and 𝑊 are the bias vector
and weight matrix, respectively, denotingas decoder parameter.
Strictly speaking, the CAE model searches the global or near optimum
parameter = (𝑊, 𝑏 𝑋, 𝑏 𝑌) in the training process. This task is equivalent to the minimization
process of loss function over all dataset 𝑋 under the following objective function:
θ = min
θ
L(X, X′) = min
θ
L(X, g(f(X))) (5)
where 𝐿(∙,∙) denotes the auto-encoder loss function. In this paper, we simply use linear
reconstruction 𝐿2 for loss function, or commonly referred as Mean Squared Error (MSE) [18].
This loss function is formally defined as:
L2(θ) = ∑‖xi − xi
′
‖2
n
i=1
= ∑‖xi − g(f(xi))‖
2
n
i=1
(6)
where 𝑥𝑖 ∈ 𝑋, 𝑥𝑖
′
∈ 𝑋′ and 𝑦𝑖 ∈ 𝑌, respectively denote the original input data, reconstructed data,
and new compact representation of input data.
In this paper, the CAE architecture was built with four encoding blocks and four
decoding stages. This architecture includes a stacked Convolutional Auto-Encoder.
The summary of CAE architecture used in this paper can be seen in Table 2. Suppose that an
input image is of size 128 × 128 × 3. As it can be inferred from Table 2, this image is convolved
four times to obtain new simpler and compact representation. This process can be also
considered as repetitive encoding. Herein, the new representation is regarded as neural code
◼ ISSN: 1693-6930
TELKOMNIKA Vol. 17, No. 6, December 2019: 3010-3018
3014
with dimensionality 4 × 4 × 128. By using the backward approach and decoding process, this
neural code can be recovered back to yield the reconstructed image of original size
128 × 128 × 3. This reverse process performs the deconvolution and unpooling operations.
The CAE neural code can be further utilized as the feature descriptor in the proposed Batik
image retrieval system.
Table 2. The CAE Architecture for Batik Image Retrieval System
Layer Type Size Output Shape
Input (128,128,3) -
Convolutional + Relu 32 (3x3) filters, 1 stride, 2 padding (128,128,32)
Max Pooling + Dropout 32 (2x2) filters, 2 stride, 0 padding (64,64,32)
Convolutional + Relu 64 (3x3) filters, 1 stride, 2 padding (64,64,64)
Max Pooling + Dropout 64 (2x2) filters, 2 stride, 0 padding (32,32,64)
Convolutional + Relu 64 (3x3) filters, 1 stride, 2 padding (32,32,64)
Max Pooling + Dropout 64 (2x2) filters, 2 stride, 0 padding (16,16,64)
Convolutional + Relu 128 (3x3) filters, 1 stride, 2 padding (16,16,128)
Max Pooling + Dropout 128 (2x2) filters, 2 stride, 0 padding (8,8,128)
Max Pooling (Neural Code) 128 (2x2) filters, 1 stride, 2 padding (4,4,128)
Unpoling 128 (2x2) filters, 2 stride, 0 padding (8,8,128)
Deconvovution + Relu 128 (3x3) filters, 1 stride, 2 padding (8,8,128)
Unpooling + Dropout 128 (2x2) filters, 2 stride, 0 padding (16,16,128)
Deconvovution + Relu 64 (3x3) filters, 1 stride, 2 padding (16,16,64)
Unpooling + Dropout 64 (2x2) filters, 2 stride, 0 padding (32,32,64)
Deconvovution + Relu 64 (3x3) filters, 1 stride, 2 padding (32,32,64)
Unpooling + Dropout 64 (2x2) filters, 2 stride, 0 padding (64,64,64)
Deconvovution + Relu 32 (3x3) filters, 1 stride, 2 padding (64,64,32)
Unpooling + Dropout 32 (2x2) filters, 2 stride, 0 padding (128,128,32)
Deconvovution + Sigmoid 32 (3x3) filters, 1 stride, 2 padding (128,128,3)
3.3. Learning process and Hyperparemeter Tuning
The CNN model is very sensitive to hyperparameter changes in the learning process,
since it utilizes the Restructured Linear Unit (ReLu) 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥) for its activation function.
This function is with the gradient descent making it very unstable in comparison with the tanh
and sigmoid activation functions. Compared to the aforementioned activation functions, ReLu
yields an identical error with 25% less iteration in learning stage [7].
In the training process of our proposed image retrieval system, we simply split
the image dataset as two folds, i.e. 75% and 25% for training and testing purpose, respectively.
The Adaptive Moment Estimation (Adam) [19] is exploited for CNN optimizer with learning rate
0.0001. We simply employ the Mean Square Error (MSE) [20] for calculating the loss function.
For avoiding the overfitting problem and dealing with small size of dataset, the proposed system
uses data augmentation technique to improve the data variation. The training and testing
processes are conducted under the Intel Core i5 2010 processor. From our experiment,
the supervised CNN and CAE models require around 10 hours and 3 days, respectively, for
the training process. At the end of training process, two deep learning based models produce
a set of image features which can be used for the descriptor in the Batik image retrieval. These
image features are simply obtained from the last layer and neural code layer of supervised CNN
and CAE models, respectively.
4. Experimental Study
Extensive experiments were carried out to investigate and examine the proposed
method performance in the Batik image retrieval system. Firstly, we give a brief description
about the image dataset used in the experiment. The effectiveness of the proposed method is
subsequently observed under visual investigation. Then, the objective performance
comparisons are further evaluated to overlook the effect of different distance metrics and
superiority of the proposed method in comparison with the former competing schemes.
4.1. Dataset
This experiment utilizes a set of Batik images, refered as Batik image dataset, over
various patterns, colors, and motifs. This image database consists of 1552 image. This
database is further divided into 97 image classes. Each class contains a set of similar images
TELKOMNIKA ISSN: 1693-6930 ◼
Batik image retrieval using convolutional neural network (Heri Prasetyo)
3015
regarding to their motifs and content appearance. Each image class owns 16 similar images, in
which all images belonging to the same class are considered as similar images. Figure 2 gives
several examples of Batik images from the dataset.
4.2. Practical Application on Batik Image Retrieval
This sub-section evaluates the performance of the proposed method under visual
investigation. The proposed method utilizes the image feature obtained from CNN and CAE
approach for performing Batik image retrieval system. The correctness of the proposed method
is determined whether the system returns a set of retrieved images correctly or not.
Figure 3 displays the retrieved images returned by the proposed image retrieval system
using the CNN and CAE image features. We only show six-teen retrieved images arranged in
ascending manner based on their similarity score. The similarity criterion is measured using the
distance score and given at the top of each image. Smaller distance value indicates more
similar between the query and target image in database. As shown in this figure, the proposed
method with CNN feature returns all retrieved images correctly. It is little regrettable that
the proposed method with CAE feature only produces six retrieved images correctly.
Figure 2. Some image samples in the Batik dataset
(a) (b)
Figure 3. Performance evaluation in terms of visual investigation
for the proposed method with: (a) CNN, and (b) CAE image feature
◼ ISSN: 1693-6930
TELKOMNIKA Vol. 17, No. 6, December 2019: 3010-3018
3016
4.3. Comparison of Porposed Methods with Direfferent Distance Metrics
This sub-section reports the effect of different distance metrics on the proposed method.
In this experiment, three distance metrics, namely Euclidean [21], Manhattan [22],
and Bray-Curtis distance [23], are extensively examined over two performance criterion,
i.e. precision and recall rate. These two scores are formally defined as:
pi(n) =
RV
n
(7)
ri(n) =
RV
M
(8)
where 𝑝𝑖(𝑛) and 𝑟𝑖(𝑛) denotes the precision and recall rate, respectively, if image 𝑖 is turned as
query image. The symbols 𝑛 and 𝑀 represent the number of retrieved images and total images
in database which is relevant to image 𝑖, respectively. 𝑅 𝑉 is the number of images which are
relevant to query image 𝑖 obtained at 𝑛 retrieved images.
Figure 4 shows the performance comparison over various distance metrics in terms of
Precision and Recall scores. All images in database are chosen as query image. The number of
retrieved images are set as 𝑛 = {1,2, … ,16}. In most cases, Bray-Curtis distance yields the best
retrieval performance compared to that of the other distance metrics for both CNN and CAE
image feature. In the Batik image retrieval system, the Bray-Curtis distance becomes a good
candidate for measuring the similarity between the query and target images in database.
Table 3 tabulates more complete comparsions for the proposed image retrieval system
using CNN and CAE features over various distance. This comparison is evaluated in terms of
average recall rate with the number of retrieved images as 𝑛 = 16. Herein, all images in
database are turned as query image. As reported in this table, the proposed method with
supervised CNN delivers better performance compared to that of CAE technique. The image
feature obtained from proposed supervised CNN method is more suitable for Batik image
retrieval task.
(a) (b)
Figure 4. Performance comparisons in terms of precision and recall rates
over various distance metrics with the image features from: (a) CNN, and (b) CAE method
4.4. Comparison against Former Methods
This sub-section summarizes the performance comparison between the proposed
supervised CNN method and former existing schemes on Batik image retrieval system. This
comparison is conducted in terms of Average Precision Recall (APR) score. The APR is
formally defined as:
TELKOMNIKA ISSN: 1693-6930 ◼
Batik image retrieval using convolutional neural network (Heri Prasetyo)
3017
𝐴𝑃𝑅 =
1
𝑁
∑ 𝑟𝑖(𝑛)
𝑁
𝑖=1
(9)
where 𝑟𝑖(𝑛) and 𝑁 are the recall rate for query image 𝑖 and the total number of images in
database, respectively. Herein, all images in database are turned as query image indicating that
𝑁 = 1552. Thus, the APR value is averaging over all query images. The number of retrieved
images is set as 16 yielding 𝑛 = 16. To make a fair comparison, this experiment also
investigates the dimensionality of image feature.
Table 4 reports the performance comparison in terms of feature dimensionality and APR
value. As shown in this table, the proposed supervised CNN yields the best performance in
comparison with the other competing schemes. It is noteworthy that the proposed method
requires lowest feature dimensionality (with exceptional on comparison to LBP [20] scheme).
This lower dimensionality indicates the faster process on KNN searching for effective Batik
image retrieval system. Thus, the proposed method can be considered on implementing
the Batik image retrieval and classification system.
Table 3. APR CNN and CAE
Method Euclidean Manhattan Bray-curtis
CNN 0.9938 0.9931 0.9947
CAE 0.6737 0.6387 0.7654
Table 4. APR Comparison with Former Method
Method Feature Size APR (%)
LBP [24] 59 92.57
LTP [25] 118 95.65
CLBP [26] 118 95.17
LDP [27] 236 93.52
Gabor Filter [28] 144 96.55
ODBTC+PSO [3] 384 97.68
Proposed Supervised CNN 97 99.47
5. Conclusions
A new content-based image retrieval system has been presented in this paper. This
system achieves the retrieval accuracies 99.47% and 76.54%, respectively, while the image
feature is constructed from CNN and CAE deep learning-based architecture on Batik image
database. The CNN outperforms the former existing schemes in terms of retrieval accuracy.
In addition, it requires the lowest image features, i.e. 97 feature dimensionality, compared to
other methods. For future work, a slight modification can be carried out for CAE model by
adding fully-connected layers before and after the neural code section. This scenario may
reduce the dimensionality of image feature, at the same time, it improves the performance for
Batik image retrieval.
References
[1] Johnson MH. The neural basis of cognitive development. In: Damon W. Editor. Handbook of
child psychology: Cognition, perception, and language. Hoboken: John Wiley & Sons Inc. 1998: 1-49.
[2] Liu P, et al. Fusion of deep learning and compressed domain features for content-based image
retrieval. IEEE Transactions on Image Processing. 2017. 26(12): 5706-5717.
[3] Prasetyo, H, et al. Batik Image Retrieval Using ODBTC Feature and Particle Swarm Optimization.
Journal of Telecommunication, Electronic Computer Engineering. 2018. 10(2-4): 71-74.
[4] Datta R, Li J, Wang JZ. Content-based image retrieval: approaches and trends of the new age.
Proceedings of the 7th
ACM SIGMM international workshop on Multimedia information retrieval. 2005.
[5] Eakins JP, Graham ME. Content based image retrieval: A report to the JISC technology applications
programme. 1999.
[6] Russakovsky O, et al. Imagenet large scale visual recognition challenge. International Journal of
Computer Vision. 2015; 115(3): 211-252.
[7] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural
networks. Advances in neural information processing systems. 2012: 1097-1105.
[8] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv. 2014.
[9] Szegedy C, et al. Going deeper with convolutions. Proceedings of the IEEE conference on computer
vision and pattern recognition. 2015: 1-9.
◼ ISSN: 1693-6930
TELKOMNIKA Vol. 17, No. 6, December 2019: 3010-3018
3018
[10] He K, et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016: 770-778.
[11] Cover T, Hart P. Nearest neighbor pattern classification. IEEE transactions on information theory.
1967; 13(1): 21-27.
[12] Petscharnig S, Lux M, Chatzichristofis S. Dimensionality reduction for image features using deep
learning and autoencoders. Proceedings of the 15th
International Workshop on Content-Based
Multimedia Indexing, ACM. 2017.
[13] Masci J, et al. Stacked convolutional auto-encoders for hierarchical feature extraction. International
Conference on Artificial Neural Networks. 2011: 52-59.
[14] Wang R, et al. A Crop Pests Image Classification Algorithm Based on Deep Convolutional
Neural Network. TELKOMNIKA Telecommunication Computing Electronics and Control. 2017;
15(3): 1239-1246.
[15] Baharin A, Abdullah A, Yousoff SNM. Prediction of Bioprocess Production Using Deep Neural
Network Method. TELKOMNIKA Telecommunication Computing Electronics and Control. 2017;
15(2): 805-813.
[16] Sudiatmika IBK, Rahman F, Trisno T, Suyoto S. Image forgery detection using error level analysis
and deep learning. TELKOMNIKA Telecommunication Computing Electronics and Control. 2019;
17(2): 653-659.
[17] Setiawan W, Utoyo MI, Rulaningtyas R. Classification of neovascularization using convolutional
neural network model. TELKOMNIKA Telecommunication Computing Electronics and Control. 2019;
17(1): 463-472.
[18] Meng Q, et al. Relational autoencoder for feature extraction. 2017 International Joint Conference on
Neural Networks (IJCNN). 2017: 364-371.
[19] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv. 2014.
[20] Hagan MT, Menhaj MB. Training feedforward networks with the Marquardt algorithm. IEEE
transactions on Neural Networks. 1994; 5(6): 989-993.
[21] Danielsson PE. Euclidean distance mapping. Computer Graphics image processing. 1980; 1
4(3): 227-248.
[22] Craw S. Manhattan distance. In: Sammut C, Webb GI. Encyclopedia of Machine Learning and Data
Mining. Springer. 2017: 790-791.
[23] Kokare M, Chatterji B, Biswas P. Comparison of similarity metrics for texture image retrieval.
TENCON 2003. IEEE, Conference on Convergent Technologies for the Asia-Pacific Region. 2003; 2:
571-575.
[24] Ojala T, Pietikainen M, Maenpaa T. Multiresolution gray-scale and rotation invariant texture
classification with local binary patterns. IEEE Transactions on pattern analysis machine intelligence.
2002; 24(7): 971-987.
[25] Tan X, Triggs B. Enhanced local texture feature sets for face recognition under difficult lighting
conditions. IEEE transactions on image processing. 2010; 19(6): 1635-1650.
[26] Guo Z, Zhang L, Zhang D. A completed modeling of local binary pattern operator for texture
classification. IEEE Transactions on Image Processing. 2010; 19(6): 1657-1663.
[27] Zhang B, et al. Local derivative pattern versus local binary pattern: face recognition with high-order
local pattern descriptor. IEEE transactions on image processing. 2010; 19(2): 533-544.
[28] Prasetyo H, Wiranto W, Winarno W. Statistical Modeling of Gabor Filtered Magnitude for Batik Image
Retrieval. Journal of Telecommunication, Electronic Computer Engineering. 2018; 10(2-4): 85-89.

More Related Content

PDF
Image compression and reconstruction using a new approach by artificial neura...
PDF
Volume 2-issue-6-1974-1978
PDF
Ijarcet vol-2-issue-7-2287-2291
PDF
2015.basicsof imageanalysischapter2 (1)
PDF
IMAGE CONTENT DESCRIPTION USING LSTM APPROACH
PDF
SIGNIFICANCE OF DIMENSIONALITY REDUCTION IN IMAGE PROCESSING
PDF
Global Descriptor Attributes Based Content Based Image Retrieval of Query Images
PDF
Web Image Retrieval Using Visual Dictionary
Image compression and reconstruction using a new approach by artificial neura...
Volume 2-issue-6-1974-1978
Ijarcet vol-2-issue-7-2287-2291
2015.basicsof imageanalysischapter2 (1)
IMAGE CONTENT DESCRIPTION USING LSTM APPROACH
SIGNIFICANCE OF DIMENSIONALITY REDUCTION IN IMAGE PROCESSING
Global Descriptor Attributes Based Content Based Image Retrieval of Query Images
Web Image Retrieval Using Visual Dictionary

What's hot (16)

PDF
Content based Image Retrieval from Forensic Image Databases
PDF
K018217680
PDF
International Journal of Engineering Research and Development (IJERD)
DOC
Morpho
PDF
Color and texture based image retrieval a proposed
PDF
IRJET- A Survey on Different Image Retrieval Techniques
PDF
A Survey of Image Based Steganography
PDF
26 3 jul17 22may 6664 8052-1-ed edit septian
PDF
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
PDF
10.1.1.432.9149
PDF
06 17443 an neuro fuzzy...
PDF
Efficient mobilenet architecture_as_image_recognit
PDF
A Review on Matching For Sketch Technique
PDF
Natural language description of images using hybrid recurrent neural network
PDF
Handwriting identification using deep convolutional neural network method
PDF
Applications of spatial features in cbir a survey
Content based Image Retrieval from Forensic Image Databases
K018217680
International Journal of Engineering Research and Development (IJERD)
Morpho
Color and texture based image retrieval a proposed
IRJET- A Survey on Different Image Retrieval Techniques
A Survey of Image Based Steganography
26 3 jul17 22may 6664 8052-1-ed edit septian
MULTI-LEVEL FEATURE FUSION BASED TRANSFER LEARNING FOR PERSON RE-IDENTIFICATION
10.1.1.432.9149
06 17443 an neuro fuzzy...
Efficient mobilenet architecture_as_image_recognit
A Review on Matching For Sketch Technique
Natural language description of images using hybrid recurrent neural network
Handwriting identification using deep convolutional neural network method
Applications of spatial features in cbir a survey
Ad

Similar to Batik image retrieval using convolutional neural network (20)

PDF
Content-based product image retrieval using squared-hinge loss trained convol...
PDF
Content-based image retrieval based on corel dataset using deep learning
PDF
Efficient content-based image retrieval using integrated dual deep convoluti...
PDF
Content Based Image Retrieval
PPTX
cbir and other trends like iot, virtual reality etc
PDF
A SURVEY ON CONTENT BASED IMAGE RETRIEVAL USING MACHINE LEARNING
PDF
Classification of Images Using CNN Model and its Variants
PDF
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
PDF
IRJET - Content based Image Classification
PDF
A survey on the layers of convolutional Neural Network
PDF
A Survey on Image Processing using CNN in Deep Learning
PPTX
Mnist report ppt
PPTX
Deep Residual Hashing Neural Network for Image Retrieval
PPTX
Introduction to Convolutional Neural Networks (CNNs).pptx
PDF
Class Weighted Convolutional Features for Image Retrieval
PDF
Image Classification using Deep Learning
PDF
Content based image retrieval project
PDF
物件偵測與辨識技術
PDF
From Pixels to Understanding: Deep Learning's Impact on Image Classification ...
PDF
Finding the best solution for Image Processing
Content-based product image retrieval using squared-hinge loss trained convol...
Content-based image retrieval based on corel dataset using deep learning
Efficient content-based image retrieval using integrated dual deep convoluti...
Content Based Image Retrieval
cbir and other trends like iot, virtual reality etc
A SURVEY ON CONTENT BASED IMAGE RETRIEVAL USING MACHINE LEARNING
Classification of Images Using CNN Model and its Variants
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
IRJET - Content based Image Classification
A survey on the layers of convolutional Neural Network
A Survey on Image Processing using CNN in Deep Learning
Mnist report ppt
Deep Residual Hashing Neural Network for Image Retrieval
Introduction to Convolutional Neural Networks (CNNs).pptx
Class Weighted Convolutional Features for Image Retrieval
Image Classification using Deep Learning
Content based image retrieval project
物件偵測與辨識技術
From Pixels to Understanding: Deep Learning's Impact on Image Classification ...
Finding the best solution for Image Processing
Ad

More from TELKOMNIKA JOURNAL (20)

PDF
Earthquake magnitude prediction based on radon cloud data near Grindulu fault...
PDF
Implementation of ICMP flood detection and mitigation system based on softwar...
PDF
Indonesian continuous speech recognition optimization with convolution bidir...
PDF
Recognition and understanding of construction safety signs by final year engi...
PDF
The use of dolomite to overcome grounding resistance in acidic swamp land
PDF
Clustering of swamp land types against soil resistivity and grounding resistance
PDF
Hybrid methodology for parameter algebraic identification in spatial/time dom...
PDF
Integration of image processing with 6-degrees-of-freedom robotic arm for adv...
PDF
Deep learning approaches for accurate wood species recognition
PDF
Neuromarketing case study: recognition of sweet and sour taste in beverage pr...
PDF
Reversible data hiding with selective bits difference expansion and modulus f...
PDF
Website-based: smart goat farm monitoring cages
PDF
Novel internet of things-spectroscopy methods for targeted water pollutants i...
PDF
XGBoost optimization using hybrid Bayesian optimization and nested cross vali...
PDF
Convolutional neural network-based real-time drowsy driver detection for acci...
PDF
Addressing overfitting in comparative study for deep learningbased classifica...
PDF
Integrating artificial intelligence into accounting systems: a qualitative st...
PDF
Leveraging technology to improve tuberculosis patient adherence: a comprehens...
PDF
Adulterated beef detection with redundant gas sensor using optimized convolut...
PDF
A 6G THz MIMO antenna with high gain and wide bandwidth for high-speed wirele...
Earthquake magnitude prediction based on radon cloud data near Grindulu fault...
Implementation of ICMP flood detection and mitigation system based on softwar...
Indonesian continuous speech recognition optimization with convolution bidir...
Recognition and understanding of construction safety signs by final year engi...
The use of dolomite to overcome grounding resistance in acidic swamp land
Clustering of swamp land types against soil resistivity and grounding resistance
Hybrid methodology for parameter algebraic identification in spatial/time dom...
Integration of image processing with 6-degrees-of-freedom robotic arm for adv...
Deep learning approaches for accurate wood species recognition
Neuromarketing case study: recognition of sweet and sour taste in beverage pr...
Reversible data hiding with selective bits difference expansion and modulus f...
Website-based: smart goat farm monitoring cages
Novel internet of things-spectroscopy methods for targeted water pollutants i...
XGBoost optimization using hybrid Bayesian optimization and nested cross vali...
Convolutional neural network-based real-time drowsy driver detection for acci...
Addressing overfitting in comparative study for deep learningbased classifica...
Integrating artificial intelligence into accounting systems: a qualitative st...
Leveraging technology to improve tuberculosis patient adherence: a comprehens...
Adulterated beef detection with redundant gas sensor using optimized convolut...
A 6G THz MIMO antenna with high gain and wide bandwidth for high-speed wirele...

Recently uploaded (20)

PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
additive manufacturing of ss316l using mig welding
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Welding lecture in detail for understanding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
composite construction of structures.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
CYBER-CRIMES AND SECURITY A guide to understanding
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Geodesy 1.pptx...............................................
additive manufacturing of ss316l using mig welding
R24 SURVEYING LAB MANUAL for civil enggi
Foundation to blockchain - A guide to Blockchain Tech
Welding lecture in detail for understanding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
composite construction of structures.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Sustainable Sites - Green Building Construction
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx

Batik image retrieval using convolutional neural network

  • 1. TELKOMNIKA, Vol.17, No.6, December 2019, pp.3010~3018 ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018 DOI: 10.12928/TELKOMNIKA.v17i6.12701 ◼ 3010 Received March 18, 2019; Revised July 2, 2019; Accepted July 18, 2019 Batik image retrieval using convolutional neural network Heri Prasetyo*1 , Berton Arie Putra Akardihas2 Department of Informatics, Universitas Sebelas Maret (UNS), Surakarta, Indonesia *Corresponding author, e-mail: heri.prasetyo@staff.uns.ac.id1 , bertonarie123@student.uns.ac.id2 Abstract This paper presents a simple technique for performing Batik image retrieval using the Convolutional Neural Network (CNN) approach. Two CNN models, i.e. supervised and unsupervised learning approach, are considered to perform end-to-end feature extraction in order to describe the content of Batik image. The distance metrics measure the similarity between the query and target images in database based on the feature generated from CNN architecture. As reported in the experimental section, the proposed supervised CNN model achieves better performance compared to unsupervised CNN in the Batik image retrieval system. In addition, image feature composed from the proposed CNN model yields better performance compared to that of the handcrafted feature descriptor. Yet, it demonstrates the superiority performance of deep learning-based approach in the Batik image retrieval system. Keywords: autoencoder, CNN, deep learning, feature extraction, image retrieval Copyright © 2019 Universitas Ahmad Dahlan. All rights reserved. 1. Introduction The brain is an amazing organ in the human body. With our brains, we can understand what we see, smell, taste, hear and touch. The infant brain weight is only about half a kilogram but can solve a big problem, and even supercomputers cannot. After several months of birth, the baby can recognize the face of his parents, discern discrete objects from the background, and begin to speak. Within one year the baby has an intuition about natural objects, can follow objects and understand the meaning of a sound. When they are children, they can understand grammar and have thousands of words in their vocabulary. Building machines that have intelligence like our brains are not easy, to make machines with artificial intelligence we have to solve very complex computing problems that we have even struggled with, problems that our brains can solve in a matter of seconds. To overcome this problem, we have to develop other ways to program computers that have been used in this decade. Therefore there arises an active field of artificial computer intelligence and also commonly called deep learning [1]. Nowadays Artificial intelligence has undergone very rapid development. Ai has been used in many fields of research, in the field of computer vision Content-Based Image Retrieval (CBIR) has been developed in multi-level schemes with low-level features to high-level features. Convolutional Neural Network (CNN) has been successfully used to be an effective descriptor feature and gain accurate results. In general, the features gain by the deep learning method are trained by mimic human perceptions through various operations such as convolution and pooling. Deep learning has become a descriptor feature that is better than low-level features. Although now the CNN module has become state of the art in computer vision this does not guarantee the features obtained from the highest level always get the best performance [2]. In the Content-Based Image Retrieval system aims to provide the right way to do the browsing, retrieving and searching some desired images that have been stored in the image database. The image database contains many images that have been stored and arranged in a storage device. Usually, the size of the image database is very large so that the process of searching for specific images manually requires a lot of time, and causes conditions that are uncomfortable for the user. For example, Batik is a cultural heritage of the archipelago Indonesia that has a high value and blend of art, laden with philosophical meanings and meaningful symbols that show the way of thinking of the people making it. Batik is a craft that has been a part of Indonesian culture especially Javanese for a long time, batik have
  • 2. TELKOMNIKA ISSN: 1693-6930 ◼ Batik image retrieval using convolutional neural network (Heri Prasetyo) 3011 a lot of motives, pattern and color so to take specific batik picture from the database very challenging [3]. This paper offers a solution to use convolutional neural networks to carry out CBIR tasks to solve problems that occur in taking batik images. The method intended is to produce effective image descriptors from the CNN architecture. Descriptors of this feature are very important for content-based shooting systems. The Image feature is used to improve the performance and to solve problems in existing batik shooting systems. 2. Content-based Image Retrieval System Image retrieval is a computer system for searching and retrieving a specific image in large or big size of image databases. The classical approach appends on the metadata such as texts, keywords, or descriptions embedded in an image. Thus, the image retrieval can be performed with the search key as aforementioned text, keywords, etc. This technique is inefficient since the manual image annotations are time-consuming and exhausting process. Even though, large amounts of automatic images annotations have been proposed in literature [4], an image retrieval system with content annotation still cannot deliver satisfactory result. CBIR is computer application dealing with the searching problems over large-scale image database. CBIR, also recognized as Query-Based Image Content (QBIC) and Content-Based Visual Information Search (CBVIR), differs with the content-based approach. The CBIR analyzes the image content rather than metadata information such image keywords, tags, or image descriptions [5]. In this paper, the usability of CNN model is extended to the CBIR task. The main reason is the superiority performance offered by CNN model compared to the handcrafted feature in the computer vision and recognition tasks. The CNN or Deep Learn network achieves the outstanding retrieval performance in the ImageNet challenge [6]. The CNN model inspires the other deep learning-based approaches, such as AlexNet [7], VGGNet [8], GoogleLeNet [9], Microsoft ResNet [10], etc., to tackle the obsolete of handcrafted feature in the image retrieval domain. The CNN model receives a three-dimensional image of size ℎ × 𝑤 × 𝑑, where ℎ and 𝑤 are spatial dimensions and 𝑑 is the number of channels. This image is further processed thorough the CNN architecture consisting several convolutions, max-poolings, and activation functions to perform end-to-end image feature generation. Let 𝑋𝑖𝑗 be a vector data located at spatial position (𝑖, 𝑗) in specific layers. The CNN computes a new data 𝑌𝑖𝑗 as follow: Yij = fks({Xsi+δi,sj+δj} 0≤δi,δj<k) (1) where 𝑘 and 𝑠 denote kernel size and stride, respectively. The function 𝑓𝑘𝑠 is the layer type used such as matrix dot multiplication for convolutional layers, max spatial for max pooling layers, nonlinear functions for activation functions, and other types of layers. This form of functionality is maintained using kernel size and step composition while still using the transformation rules. fks°gk′s′ = (f°g)k′+(k−1)s′,ss′ (2) While a general network computes general nonlinear functions, a network with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional network. FCN naturally operates at any size input and produces the appropriate spatial dimensions. The loss function is valued composed with the FCN defines task. If the loss function is a sum over the spatial dimensions of the final layer 𝑙(𝑥; 𝜃) = ∑ 𝑙′ 𝑖𝑗 (𝑥𝑖𝑗; 𝜃), the parameter gradient will be a sum over the parameter gradients of each of its spatial components. Thus stochastic gradient on 𝑙 computed on whole images will be the same as the stochastic gradient on 𝑙′, taking all the final receptive fields as minibatch. When calculating this receptive field is done repeatedly with forward and backward propagation operations feedback will be more effective if the calculation is done layer by layer in all images compared to computing patch by patch to the part of the image. An illustration of a CNN operation can be seen in Figure 1.
  • 3. ◼ ISSN: 1693-6930 TELKOMNIKA Vol. 17, No. 6, December 2019: 3010-3018 3012 The proposed CNN model constructs the feature descriptor from Batik image. This feature descriptor is to measure the similarity between query and target images in database under the K-Nearest Neighbors (KNN) [11] strategy. This KNN technique performs similarity matching with the distance score criterion. This paper investigates two CNN models in the training stage, i.e. with supervised and unsupervised learning approaches. Figure 1 illustrates an example of proposed supervised CNN architecture for Batik image retrieval. The supervised terminology refers to the utilization of class label, whereas unsupervised disobeys the image label in the training process. Autoencoder is simple example of unsupervised CNN method which compresses the data features into smaller size and recovers back to the original data [12]. Figure 1. Ilustration operation using CNN 3. Method This section presents two methods for generating the feature descriptor in the Batik image retrieval system. We firstly explain the supervised CNN model. Then, the unsupervised CAE model [13] is subsequently described in this section. 3.1. Supervised Learning The CNN model is the supervised deep learning-based approach commonly used in the image classification [14], prediction [15], segmentation, analysis [16], etc. The supervised CNN model consists of several layers such as convolutional layer, max pooling layer, etc. These layers are repeated over several times and fed into the fully connected layer at the end of CNN layer [17]. Our proposed image retrieval system employs the CNN architecture with six convolutional layers and two fully connected layers to generate Batik feature descriptor. Table 1 summarizes the CNN architecture used in our proposed method. Table 1. The Supervised CNN Architecture for Batik Image Retrieval Layer Type Size Output Shape Input (128,128,3) - Convolutional + Relu 8 (3x3) filters, 1 stride, 2 padding (128,128,8) Max Pooling 8 (2x2) filters, 2 stride, 0 padding (64,64,8) Convolutional + Relu 16 (3x3) filters, 1 stride, 2 padding (64,64,16) Max Pooling 16 (2x2) filters, 2 stride, 0 padding (32,32,16) Convolutional + Relu 32 (3x3) filters, 1 stride, 2 padding (32,32,32) Max Pooling 32 (2x2) filters, 2 stride, 0 padding (16,16,32) Convolutional + Relu 64 (3x3) filters, 1 stride, 2 padding (16,16,64) Max Pooling 64 (2x2) filters, 2 stride, 0 padding (8,8,64) Convolutional + Relu 128 (3x3) filters, 1 stride, 2 padding (8,8,128) Max Pooling 128 (2x2) filters, 2 stride, 0 padding (4,4,128) Convolutional + Relu 256 (3x3) filters, 1 stride, 2 padding (4,4,256) Max Pooling 256 (2x2) filters, 2 stride, 0 padding (2,2,256) Flatern + Dropout (30%) (1,1,1024) 1024 neurons 256 Dense 256 neurons 97 Softmax 97 way 97
  • 4. TELKOMNIKA ISSN: 1693-6930 ◼ Batik image retrieval using convolutional neural network (Heri Prasetyo) 3013 After performing six convolution and max-pooling operations, an input image of size 128 × 128 × 3 is converted into new representation with dimensionality 2 × 2 × 256. This new data representation is then flatten to become one dimensional data of size 1 × 1 × 1024. This flatten data is subsequently processed and trained with the Multi-Layer Perceptron (MLP). Herein, the MLP receives 1024 input feature and feeds into 1024 input neurons. The hidden and output layers are set as 256 and 97, respectively. The value of 97 in output layers is equivalent to that of the desired class target, i.e. the number of Batik image classes used in the proposed image retrieval system. 3.2. CAE Unsupervised Learning This paper also considers the other CNN model, namely Convolutional Auto-Encoder (CAE), for generating image feature. The CAE is an unsupervised deep learning-based method, i.e. the image label is not required in the training process. In order to generate image feature, this technique learns and captures the information from input data directly without the availability of class label. The CAE involves two parts, i.e. encoder and decoder blocks. The encoder block processes the sample data 𝑋 consisting 𝑛 samples and 𝑚 features to yield the output 𝑌. In the opposite side, the decoder aims to reconstruct the original sample data 𝑋 from the 𝑌. Let 𝑋′ be the reconstructed data produced at the decoder side. The main goal of CAE is to minimize the difference between the original data 𝑋 and reconstructed version 𝑋′. Specifically, the encoder simply maps the input 𝑋 into new representation 𝑌 with the help of function 𝑓. This process can be formulated as follow: Y = f(X) = sf (WX + bX) (3) where 𝑠𝑓 denotes the nonlinear activation function in encoder side. CAE simply performs a linear operation if one simply uses identity function for 𝑠𝑓. The 𝑊 and 𝑏 𝑋 ∈ 𝑅 𝑛 are encoder parameters, respectively, referring as weight matrix and bias vector. In contrast, the decoder reconstructs 𝑋′ from 𝑌 representation by means of function 𝑔. This process can be simply illustrated as: X′ = g(Y) = sg (W′ Y + bY) (4) where 𝑠𝑔 represents the activation function in decoder side. The 𝑏 𝑌 and 𝑊 are the bias vector and weight matrix, respectively, denotingas decoder parameter. Strictly speaking, the CAE model searches the global or near optimum parameter = (𝑊, 𝑏 𝑋, 𝑏 𝑌) in the training process. This task is equivalent to the minimization process of loss function over all dataset 𝑋 under the following objective function: θ = min θ L(X, X′) = min θ L(X, g(f(X))) (5) where 𝐿(∙,∙) denotes the auto-encoder loss function. In this paper, we simply use linear reconstruction 𝐿2 for loss function, or commonly referred as Mean Squared Error (MSE) [18]. This loss function is formally defined as: L2(θ) = ∑‖xi − xi ′ ‖2 n i=1 = ∑‖xi − g(f(xi))‖ 2 n i=1 (6) where 𝑥𝑖 ∈ 𝑋, 𝑥𝑖 ′ ∈ 𝑋′ and 𝑦𝑖 ∈ 𝑌, respectively denote the original input data, reconstructed data, and new compact representation of input data. In this paper, the CAE architecture was built with four encoding blocks and four decoding stages. This architecture includes a stacked Convolutional Auto-Encoder. The summary of CAE architecture used in this paper can be seen in Table 2. Suppose that an input image is of size 128 × 128 × 3. As it can be inferred from Table 2, this image is convolved four times to obtain new simpler and compact representation. This process can be also considered as repetitive encoding. Herein, the new representation is regarded as neural code
  • 5. ◼ ISSN: 1693-6930 TELKOMNIKA Vol. 17, No. 6, December 2019: 3010-3018 3014 with dimensionality 4 × 4 × 128. By using the backward approach and decoding process, this neural code can be recovered back to yield the reconstructed image of original size 128 × 128 × 3. This reverse process performs the deconvolution and unpooling operations. The CAE neural code can be further utilized as the feature descriptor in the proposed Batik image retrieval system. Table 2. The CAE Architecture for Batik Image Retrieval System Layer Type Size Output Shape Input (128,128,3) - Convolutional + Relu 32 (3x3) filters, 1 stride, 2 padding (128,128,32) Max Pooling + Dropout 32 (2x2) filters, 2 stride, 0 padding (64,64,32) Convolutional + Relu 64 (3x3) filters, 1 stride, 2 padding (64,64,64) Max Pooling + Dropout 64 (2x2) filters, 2 stride, 0 padding (32,32,64) Convolutional + Relu 64 (3x3) filters, 1 stride, 2 padding (32,32,64) Max Pooling + Dropout 64 (2x2) filters, 2 stride, 0 padding (16,16,64) Convolutional + Relu 128 (3x3) filters, 1 stride, 2 padding (16,16,128) Max Pooling + Dropout 128 (2x2) filters, 2 stride, 0 padding (8,8,128) Max Pooling (Neural Code) 128 (2x2) filters, 1 stride, 2 padding (4,4,128) Unpoling 128 (2x2) filters, 2 stride, 0 padding (8,8,128) Deconvovution + Relu 128 (3x3) filters, 1 stride, 2 padding (8,8,128) Unpooling + Dropout 128 (2x2) filters, 2 stride, 0 padding (16,16,128) Deconvovution + Relu 64 (3x3) filters, 1 stride, 2 padding (16,16,64) Unpooling + Dropout 64 (2x2) filters, 2 stride, 0 padding (32,32,64) Deconvovution + Relu 64 (3x3) filters, 1 stride, 2 padding (32,32,64) Unpooling + Dropout 64 (2x2) filters, 2 stride, 0 padding (64,64,64) Deconvovution + Relu 32 (3x3) filters, 1 stride, 2 padding (64,64,32) Unpooling + Dropout 32 (2x2) filters, 2 stride, 0 padding (128,128,32) Deconvovution + Sigmoid 32 (3x3) filters, 1 stride, 2 padding (128,128,3) 3.3. Learning process and Hyperparemeter Tuning The CNN model is very sensitive to hyperparameter changes in the learning process, since it utilizes the Restructured Linear Unit (ReLu) 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥) for its activation function. This function is with the gradient descent making it very unstable in comparison with the tanh and sigmoid activation functions. Compared to the aforementioned activation functions, ReLu yields an identical error with 25% less iteration in learning stage [7]. In the training process of our proposed image retrieval system, we simply split the image dataset as two folds, i.e. 75% and 25% for training and testing purpose, respectively. The Adaptive Moment Estimation (Adam) [19] is exploited for CNN optimizer with learning rate 0.0001. We simply employ the Mean Square Error (MSE) [20] for calculating the loss function. For avoiding the overfitting problem and dealing with small size of dataset, the proposed system uses data augmentation technique to improve the data variation. The training and testing processes are conducted under the Intel Core i5 2010 processor. From our experiment, the supervised CNN and CAE models require around 10 hours and 3 days, respectively, for the training process. At the end of training process, two deep learning based models produce a set of image features which can be used for the descriptor in the Batik image retrieval. These image features are simply obtained from the last layer and neural code layer of supervised CNN and CAE models, respectively. 4. Experimental Study Extensive experiments were carried out to investigate and examine the proposed method performance in the Batik image retrieval system. Firstly, we give a brief description about the image dataset used in the experiment. The effectiveness of the proposed method is subsequently observed under visual investigation. Then, the objective performance comparisons are further evaluated to overlook the effect of different distance metrics and superiority of the proposed method in comparison with the former competing schemes. 4.1. Dataset This experiment utilizes a set of Batik images, refered as Batik image dataset, over various patterns, colors, and motifs. This image database consists of 1552 image. This database is further divided into 97 image classes. Each class contains a set of similar images
  • 6. TELKOMNIKA ISSN: 1693-6930 ◼ Batik image retrieval using convolutional neural network (Heri Prasetyo) 3015 regarding to their motifs and content appearance. Each image class owns 16 similar images, in which all images belonging to the same class are considered as similar images. Figure 2 gives several examples of Batik images from the dataset. 4.2. Practical Application on Batik Image Retrieval This sub-section evaluates the performance of the proposed method under visual investigation. The proposed method utilizes the image feature obtained from CNN and CAE approach for performing Batik image retrieval system. The correctness of the proposed method is determined whether the system returns a set of retrieved images correctly or not. Figure 3 displays the retrieved images returned by the proposed image retrieval system using the CNN and CAE image features. We only show six-teen retrieved images arranged in ascending manner based on their similarity score. The similarity criterion is measured using the distance score and given at the top of each image. Smaller distance value indicates more similar between the query and target image in database. As shown in this figure, the proposed method with CNN feature returns all retrieved images correctly. It is little regrettable that the proposed method with CAE feature only produces six retrieved images correctly. Figure 2. Some image samples in the Batik dataset (a) (b) Figure 3. Performance evaluation in terms of visual investigation for the proposed method with: (a) CNN, and (b) CAE image feature
  • 7. ◼ ISSN: 1693-6930 TELKOMNIKA Vol. 17, No. 6, December 2019: 3010-3018 3016 4.3. Comparison of Porposed Methods with Direfferent Distance Metrics This sub-section reports the effect of different distance metrics on the proposed method. In this experiment, three distance metrics, namely Euclidean [21], Manhattan [22], and Bray-Curtis distance [23], are extensively examined over two performance criterion, i.e. precision and recall rate. These two scores are formally defined as: pi(n) = RV n (7) ri(n) = RV M (8) where 𝑝𝑖(𝑛) and 𝑟𝑖(𝑛) denotes the precision and recall rate, respectively, if image 𝑖 is turned as query image. The symbols 𝑛 and 𝑀 represent the number of retrieved images and total images in database which is relevant to image 𝑖, respectively. 𝑅 𝑉 is the number of images which are relevant to query image 𝑖 obtained at 𝑛 retrieved images. Figure 4 shows the performance comparison over various distance metrics in terms of Precision and Recall scores. All images in database are chosen as query image. The number of retrieved images are set as 𝑛 = {1,2, … ,16}. In most cases, Bray-Curtis distance yields the best retrieval performance compared to that of the other distance metrics for both CNN and CAE image feature. In the Batik image retrieval system, the Bray-Curtis distance becomes a good candidate for measuring the similarity between the query and target images in database. Table 3 tabulates more complete comparsions for the proposed image retrieval system using CNN and CAE features over various distance. This comparison is evaluated in terms of average recall rate with the number of retrieved images as 𝑛 = 16. Herein, all images in database are turned as query image. As reported in this table, the proposed method with supervised CNN delivers better performance compared to that of CAE technique. The image feature obtained from proposed supervised CNN method is more suitable for Batik image retrieval task. (a) (b) Figure 4. Performance comparisons in terms of precision and recall rates over various distance metrics with the image features from: (a) CNN, and (b) CAE method 4.4. Comparison against Former Methods This sub-section summarizes the performance comparison between the proposed supervised CNN method and former existing schemes on Batik image retrieval system. This comparison is conducted in terms of Average Precision Recall (APR) score. The APR is formally defined as:
  • 8. TELKOMNIKA ISSN: 1693-6930 ◼ Batik image retrieval using convolutional neural network (Heri Prasetyo) 3017 𝐴𝑃𝑅 = 1 𝑁 ∑ 𝑟𝑖(𝑛) 𝑁 𝑖=1 (9) where 𝑟𝑖(𝑛) and 𝑁 are the recall rate for query image 𝑖 and the total number of images in database, respectively. Herein, all images in database are turned as query image indicating that 𝑁 = 1552. Thus, the APR value is averaging over all query images. The number of retrieved images is set as 16 yielding 𝑛 = 16. To make a fair comparison, this experiment also investigates the dimensionality of image feature. Table 4 reports the performance comparison in terms of feature dimensionality and APR value. As shown in this table, the proposed supervised CNN yields the best performance in comparison with the other competing schemes. It is noteworthy that the proposed method requires lowest feature dimensionality (with exceptional on comparison to LBP [20] scheme). This lower dimensionality indicates the faster process on KNN searching for effective Batik image retrieval system. Thus, the proposed method can be considered on implementing the Batik image retrieval and classification system. Table 3. APR CNN and CAE Method Euclidean Manhattan Bray-curtis CNN 0.9938 0.9931 0.9947 CAE 0.6737 0.6387 0.7654 Table 4. APR Comparison with Former Method Method Feature Size APR (%) LBP [24] 59 92.57 LTP [25] 118 95.65 CLBP [26] 118 95.17 LDP [27] 236 93.52 Gabor Filter [28] 144 96.55 ODBTC+PSO [3] 384 97.68 Proposed Supervised CNN 97 99.47 5. Conclusions A new content-based image retrieval system has been presented in this paper. This system achieves the retrieval accuracies 99.47% and 76.54%, respectively, while the image feature is constructed from CNN and CAE deep learning-based architecture on Batik image database. The CNN outperforms the former existing schemes in terms of retrieval accuracy. In addition, it requires the lowest image features, i.e. 97 feature dimensionality, compared to other methods. For future work, a slight modification can be carried out for CAE model by adding fully-connected layers before and after the neural code section. This scenario may reduce the dimensionality of image feature, at the same time, it improves the performance for Batik image retrieval. References [1] Johnson MH. The neural basis of cognitive development. In: Damon W. Editor. Handbook of child psychology: Cognition, perception, and language. Hoboken: John Wiley & Sons Inc. 1998: 1-49. [2] Liu P, et al. Fusion of deep learning and compressed domain features for content-based image retrieval. IEEE Transactions on Image Processing. 2017. 26(12): 5706-5717. [3] Prasetyo, H, et al. Batik Image Retrieval Using ODBTC Feature and Particle Swarm Optimization. Journal of Telecommunication, Electronic Computer Engineering. 2018. 10(2-4): 71-74. [4] Datta R, Li J, Wang JZ. Content-based image retrieval: approaches and trends of the new age. Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval. 2005. [5] Eakins JP, Graham ME. Content based image retrieval: A report to the JISC technology applications programme. 1999. [6] Russakovsky O, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision. 2015; 115(3): 211-252. [7] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012: 1097-1105. [8] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv. 2014. [9] Szegedy C, et al. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9.
  • 9. ◼ ISSN: 1693-6930 TELKOMNIKA Vol. 17, No. 6, December 2019: 3010-3018 3018 [10] He K, et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [11] Cover T, Hart P. Nearest neighbor pattern classification. IEEE transactions on information theory. 1967; 13(1): 21-27. [12] Petscharnig S, Lux M, Chatzichristofis S. Dimensionality reduction for image features using deep learning and autoencoders. Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, ACM. 2017. [13] Masci J, et al. Stacked convolutional auto-encoders for hierarchical feature extraction. International Conference on Artificial Neural Networks. 2011: 52-59. [14] Wang R, et al. A Crop Pests Image Classification Algorithm Based on Deep Convolutional Neural Network. TELKOMNIKA Telecommunication Computing Electronics and Control. 2017; 15(3): 1239-1246. [15] Baharin A, Abdullah A, Yousoff SNM. Prediction of Bioprocess Production Using Deep Neural Network Method. TELKOMNIKA Telecommunication Computing Electronics and Control. 2017; 15(2): 805-813. [16] Sudiatmika IBK, Rahman F, Trisno T, Suyoto S. Image forgery detection using error level analysis and deep learning. TELKOMNIKA Telecommunication Computing Electronics and Control. 2019; 17(2): 653-659. [17] Setiawan W, Utoyo MI, Rulaningtyas R. Classification of neovascularization using convolutional neural network model. TELKOMNIKA Telecommunication Computing Electronics and Control. 2019; 17(1): 463-472. [18] Meng Q, et al. Relational autoencoder for feature extraction. 2017 International Joint Conference on Neural Networks (IJCNN). 2017: 364-371. [19] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv. 2014. [20] Hagan MT, Menhaj MB. Training feedforward networks with the Marquardt algorithm. IEEE transactions on Neural Networks. 1994; 5(6): 989-993. [21] Danielsson PE. Euclidean distance mapping. Computer Graphics image processing. 1980; 1 4(3): 227-248. [22] Craw S. Manhattan distance. In: Sammut C, Webb GI. Encyclopedia of Machine Learning and Data Mining. Springer. 2017: 790-791. [23] Kokare M, Chatterji B, Biswas P. Comparison of similarity metrics for texture image retrieval. TENCON 2003. IEEE, Conference on Convergent Technologies for the Asia-Pacific Region. 2003; 2: 571-575. [24] Ojala T, Pietikainen M, Maenpaa T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on pattern analysis machine intelligence. 2002; 24(7): 971-987. [25] Tan X, Triggs B. Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE transactions on image processing. 2010; 19(6): 1635-1650. [26] Guo Z, Zhang L, Zhang D. A completed modeling of local binary pattern operator for texture classification. IEEE Transactions on Image Processing. 2010; 19(6): 1657-1663. [27] Zhang B, et al. Local derivative pattern versus local binary pattern: face recognition with high-order local pattern descriptor. IEEE transactions on image processing. 2010; 19(2): 533-544. [28] Prasetyo H, Wiranto W, Winarno W. Statistical Modeling of Gabor Filtered Magnitude for Batik Image Retrieval. Journal of Telecommunication, Electronic Computer Engineering. 2018; 10(2-4): 85-89.