SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 10, No. 3, June 2020, pp. 2367~2374
ISSN: 2088-8708, DOI: 10.11591/ijece.v10i3.pp2367-2374  2367
Journal homepage: http://ijece.iaescore.com/index.php/IJECE
Combined cosine-linear regression model similarity with
application to handwritten word spotting
Youssef Elfakir, Ghizlane Khaissidi, Mostafa Mrabti, Driss Chenouni, Manal Boualam
Laboratory of Computing and Interdisciplinary Physics, ENS, Sidi Mohamed Ben Abdellah University, Morocco
Article Info ABSTRACT
Article history:
Received Apr 5, 2019
Revised Nov 9, 2019
Accepted Nov 25, 2019
Abstract: the similarity or the distance measure have been used widely to
calculate the similarity or dissimilarity between vector sequences, where
the document images similarity is known as the domain that dealing with
image information and both similarity/distance has been an important role for
matching and pattern recognition. There are several types of similarity
measure, we cover in this paper the survey of various distance measures used
in the images matching and we explain the limitations associated with
the existing distances. Then, we introduce the concept of the floating
distance which describes the variation of the threshold’s selection for each
word in decision making process, based on a combination of Linear
Regression and cosine distance. Experiments are carried out on a handwritten
Arabic image documents of Gallica library. These experiments show that
the proposed floating distance outperforms the traditional distance in word
spotting system.
Keywords:
Bag-of-visual word
Features extractions
Floating threshold
Handwritten Arabic documents
Similarity distance
Copyright © 2020 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Youssef Elfakir,
Laboratory of Computing and Interdisciplinary Physics,
National Superior School, Sidi Mohammed Ben Abdellah University,
Fes 30050, Morocco.
Email: youssef.elfakir1@usmba.ac.ma
1. INTRODUCTION
In the pattern recognition fields, the objects can be represented as sequence of features, where
a similarity measure can be used as a tool to judge the similarity or dissimilarity between two sequences of
features. From literature, similar is defined as looking or being near between two objects, but not the same.
In this field, several similarity and distance measure have been used widely in various fields of the area: text
similarity [1], document similarity [2], handwriting recognition [3-5], handwritten character [6, 7],
speech recognition [8-10], video analysis [11]. So use a good similarity measure is fundamental step for
many types of application and domain such as information retrieval and recognition, chemistry,
clustering or classification.
The term similarity in the pattern recognition is mean a score that represents the strength of
relationship between two sequences of data items. The pattern recognition or the word sporting is done based
on a similarity measure to search a similar image that are looking or being near together. Mathematically,
similarity distance is used to calculate how far between two sequences of data, is also named dissimilarity in
other domains, or the concept is though the same and inversely [12]. Figure 1 shows the chronological
overview of the similarity and distance measure that is commonly used in various fields, based on feature
sequences of data [13, 14], which are divided into three different groups, one group distance based,
the second and the third groups of similarity measure are respectively correlation and non-correlation based.
Similarity measure is an important topic been extensively applied in some fields such as decision
making, pattern recognition, machine learning and market prediction based on distance function such as
Euclidean, Manhattan, Minkowski and Cosine distance similarity, etc. Or, many limitations associated with
distance measures which have been have mentioned in this paper and are aiming to overcome in our research.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2367 - 2374
2368
Figure 1. Chronological overview of the similarity and distance measures
The organization of this paper is as follows. First we describe on some of chronological overview of
the similarity and distance measures used for similarity searching and retrieving systems.
Then, we investigate the limitations of the existing distances in word spotting system in detail. Finally,
we propose a floating distance based on the combination of Linear Regression and cosine distance and we
apply them on a handwritten Arabic image documents of Gallica.
2. OVERWIEW OF SIMILARITY AND DISTANCE METRICS
For years, different distance measures are used to calculate the similarity between two data objects,
the aim of these metrics is to found a specific distance function that allow a separation or classification
between elements in a set of data [15, 16]. In fact, these distances operate directly in various fields as
similarity searching and retrieval systems where the performance depends upon choosing function.
Here, we present several similarity distances commonly used in pattern recognition and word spotting
systems.
2.1. Euclidean distance
The Euclidean distance (1) is defined as the general and the standard metric for used in geometrical
cases, the mathematical formula of this distance is defined as :
√∑ (1)
Which is a simple metric between two points of data ( that determines the root of square difference
between them, is the default metric in the k-means algorithm and the most used in different clustering
problems.
2.2. Manhattan distance
Manhattan or Taxicab distance (2) is a metric that calculates the absolute difference between two
points of data, is also known as L1 distance, L1 distance or rectilinear distance, is defined as:
| | (2)
2.3. Minkowski distance
The Minkowski distance (3) known as the generalized metric, can be used in case of data that are
ordinal and quantitative. The distance of order p between two variables is defined as:
√∑ (3)
This distance is similar to the Euclidean distance when p=2 and the Manhattan distance when p=1.
Int J Elec & Comp Eng ISSN: 2088-8708 
Combined cosine-linear regression model similarity with application to handwritten… (Youssef Elfakir)
2369
2.4. Cosine distance
Cosine distance (4) is the cosine of the angle between two vectors in an n-dimensional space given
by the following formula:
‖ ‖ ‖ ‖
∑
√∑ √∑
(4)
This distance represents the dot product between two vectors divided by the product of the two vectors'
lengths, and is often used in information retrieval and text mining.
2.5. Jaccard distance
Jaccard distance (coefficient), a term coined by Paul Jaccard, measures similarities of the two data
items. The Jaccard distance (5) is measured by the following formula:
| |
| |
(5)
The function is best used when calculating the similarity between small numbers of sets, and is often used to
in the recommendation flied.
2.6. Chebyshev distance
Chebyshev distance (6), is also called Tchebychev or the maximum value distance. This metric
calculates the absolute magnitude of the difference between two points of data by following:
(| |) (6)
3. COMPARISON OF DISTANCES METRIC IN WORD SPOTTING SYSTEMS
The aim of Query-by-example word spotting is search and localize they regions that are similar for
a given query. Or to get the best performance still as a challenge, because several tasks influence on this
system, as image processing, feature extractions, similarity measures, classification. While our work focuses
on the influence of similarity measure distances in the context of a handwritten document images retrieval
where many existing metrics can be adapted. We underline that the presented measure of similarity is
applicable to any problem where the size of query is different one from the other.
In this part, we present a comparison of distance measures in word spotting system. For this, we use
the Arabic handwritten document images form Gallica, which is the digital library of the National Library of
France, in open access. First, we divide the document images into a set of local regions, densely sampled;
these local regions are the basic structure used to spot the words in the document, and we extract all interests
points from the images using SIFT [17] detector, and we use SIFT descriptor to represent each interest point
in the images by their descriptor. Then, in the learning step, we select the 4 first images of document to group
all descriptors to provide bag of visual descriptors and cluster centers, in this stage, the k-means algorithm is
used with different number of centers (100,200,300,400). In the end, we encoding each region by a histogram
using a bag-of-visual-descriptors [18, 19] instead to represent them by their descriptors. All regions are
described with their histograms by assigning each descriptor in the region to the nearest visual descriptor
in the codebook. So, each region is represented by a histogram of accumulates frequencies
where N represents the number of the visual descriptor in
the codebook and represents the cumulative frequency of k visual descriptor in the jeme
region.
The process of the used system is presented in Figure 2. To evaluate the effect of similarity measure
and determine the appropriate distance in the context of word spotting, we change this metric and we
calculate the detection accuracy represented by F-score for different queries. The F-score is calculated
as follows:
(7)
Table 1 shows the F-score results for Gallica dataset, we observe that the F-score depending on
the codebook size and also the metric distance used. For the cosine metric and code book size 300 we obtain
F-score=0,78. While for the other metric we reported F-socre=0, 62 for Euclidean distance, F-socre = 0,52
for Chebyshev distance. So, the best result is obtained for the cosine metric and 300 clusters.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2367 - 2374
2370
Figure 2. Process of the proposed system
Table 1. F-score for different codebook sizes and different metric distances
Codebook size 100 200 300 400
Euclidean distance 0.27 0.58 0.62 0.53
Manhattan distance 0.4 0.46 0.5 0.48
Minkowski distance 0.15 0.23 0.34 0.30
Cosine distance 0.627 0.76 0.78 0.629
Jaccard distance 0.27 0.36 0.5 0.41
Chebyshev distance 0.37 0.42 0.52 0.45
4. LINEAR REGRESSION MODEL SIMILARITY WITH APPLICATION TO HANDWRITTEN
WORD SPOTTING
4.1. Analysis of cosine distance measure for word spotting
As shown in the previous part, the metric that gives the best result is the cosine distance. However,
Table 1 shows that the size of the codebook influence directly on the performance. So we believe it is
therefore necessarily to analyses diverse phenomena that appear when using codebook in high-dimensional
spaces, this phenomena can be explain by the curse of dimensionality, the term was invented by Richard
E.Bellman in [20, 21]. For this, the system presented in Figure 2 is therefore used to evaluate in detail
different sizes of the codebook from 100 to 900 centers.
Then, to calculate the similarity between the histograms of each region in the image document and
the query’s histogram, we use the cosine similarity defined by:
∑
√∑ √∑
(8)
Where represents the occurrence of the ieme center of the codebook in the jeme region, and represents
the occurrence of the ieme center of the codebook. Or, to select them regions that are similar to the query,
the cosine similarity “S” should be inferior to a certain threshold. For two regions perfectly similar
∑
√∑ √∑
, so S=0. Therefore, as much as the threshold is less, the regions are more similar.
For this, for each codebook size, we calculate the F-Score by choosing the best threshold of each size as
shown in Figure 3. We observe that the size influence on the performance and best result is given for k=300
and decrease beyond this size.
Figure 3. The F-Score curve
Int J Elec & Comp Eng ISSN: 2088-8708 
Combined cosine-linear regression model similarity with application to handwritten… (Youssef Elfakir)
2371
Now, we search the influence of the codebook size on the threshold. Form (2), when the size of
the codebook increase (N), the probability to find zero descriptor in each cellule in the histogram for a given
region increase as shown in Figure 3, that mean:
i ,
hen increase, the probability i increase
hen decrease, the probability i decrease
Subsequently,
i ,
hen increase, the probability ∑ ( i,j) i decrease
hen decrease, the probability ∑ ( i,j) i increase
∑
√∑ √∑
∑
√∑ √∑
Where Ri represents the histogram of the query, Hi,j represents the histogram of jeme region in the document
and ∑ ( i,j) is the number of interest points in jeme region. So when the size of histogram (N)
increase, the number of zero cellule increase too. Consequently, the threshold should take a count the size of
the codebook (number of cluster). Figure 4 shows example of query and region histograms.
For the experiments step, Figure 5 shows the curve of the best threshold for each codebook size.
The threshold depends on the size of the codebook and can represented by : y=0.000265x+0.3993.
Figure 4. Example of query and region histograms
Figure 5. Best threshold for each codebook size
At this stage, and to perform the experiments, we should search the influence of the interest points
number (number of descriptors n) on the threshold. Form (2), when the number of interest points n increase,
the probability to find zero descriptor in each cellule in the histogram for a given region decrease as shown in
Figure 3, that mean:
i ,
hen n increase, the probability i decrease
hen n decrease, the probability i increase
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2367 - 2374
2372
Subsequently,
i ,
hen n increase, the probability ∑ ( i,j) i increase
hen n decrease, the probability ∑ ( i,j) i decrease
∑
√∑ √∑
∑
√∑ √∑
In fact, when the number of interest points (n) increase, the number of zero cellule decrease too.
Consequently, the threshold should take a count the number of interest points (number of descriptors),
because this number change from region to other. Experimently, Figure 6 shows the threshold curve
according to the number of descriptors where we can see that the threshold depend on the descriptors,
this dependence can be represented by: y=-0.0009x+0.5219.
Figure 6. Threshold according to the number of descriptors
Given this condition, two regions are similar if the threshold S is less than a given value, or if two
having even number of interest points will have same threshold, but this not mean that they are similar,
because maybe they have different histograms and thereafter different distances of similarity. To conclude
this part, each word/region in document has a certain number of interest points/descriptors, so a compromise
between the number of descriptors and the threshold must be sought.
4.2. Combination of cosine-linear regression similarity
Regression focuses on the relationship between the outcomes and the inputs. It also provides
a model that has some explanatory value, in our case, the inputs are size of codebook and number of interest
points, the outcome is the threshold value that define if two regions are similar or not. Linear regression is
a commonly used technique for modeling when the outcome variable is expressed as linear combination of
the input variables, for a given set of input variables, the linear regression model provides the expected
outcome value.
+ (9)
where is the threshold variable
are the input variables, for j , , ,p-
is the value of when each equals ero {Offset}
is the change in based on a unit change in {Coefficient}
, and the ’s are independent of each other
This random error ( , denoted by ɛ, is assumed to be normally distributed with a mean of ero and a
constant variance ). So, the used Equation is: T=-0, 0009 * n+0, 5219 for N=300.
Int J Elec & Comp Eng ISSN: 2088-8708 
Combined cosine-linear regression model similarity with application to handwritten… (Youssef Elfakir)
2373
However, the performance of regression analysis methods in practice depends on the type and form
of the used data, and how it relates to the used regression method. As is shown in Figures 4 and 5,
the threshold depends on the codebook size and the number of descriptors at each word, or the proposed
method is a generic similarity measure that take a count these parameters and it does not make any specific
condition to handwritten documents. The proposed similarity measure is evaluated in the context of word
spotting task. The aim consists in querying Arabic of handwritten documents with query-by-example
anagram and in selecting/reporting regions that similar to the query.
We report here some queries where the system yields automatically as shown in Figure 7,
which are similar to the query and without chose the thresold , which is a problem in other system [22, 23].
Then, we use a filtering step to select one best result when confusion regions are returning, based on theirs
similarity scores and theirs positions. Table 2 shows a comparison of the proposed method with state-of-
the-art in term of Precision, tseted in the Gallica dataset. An improvement is observed with the proposed
similarity distance over other methods.
Table 2. Performance of the proposed method and other works
Method Precision
Almazon et al. [23] 68.4%
Rusinol et al. [24] 81%
Proposed method 83%
Howe et al. [25] 79%
Liang et al. [26] 67%
Fischer et al.[27] 62%
Terasawa et al. [28] 79%
Figure 7. The retrieved regions for some queries
5. CONCLUSION
In this paper, we present a generic similarity measure for word spotting system that take a count
the codebook size and the number of descriptors at query, and it does not make any specific condition to
handwritten documents,a comparison of the proposed method with other methods shows that generic
similarity measure gives an excellent result in term of Precision. We tested our method using experimental
setup based on MATLAB code applied to Gallica database
REFERENCES
[1] Mihalcea R., Corley C. and Strapparava C., "Corpus-based and knowledge-based measures of text semantic
similarity," In AAAI, vol. 6, pp. 775-780, 2006.
[2] Huang, Anna, "Similarity measures for text document clustering," In: Proceedings of the sixth new zealand
computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pp. 9-56, 2008.
[3] Marti, U. V., Bunke, H., "Using a statistical language model to improve the performance of an HMM-based cursive
handwriting recognition system," In Hidden Markov models: applications in computer vision, pp. 65-90, 2001.
[4] H. Bunke, et. al., "Offline recognition of unconstrained handwritten texts using HMMs and statistical language
models," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 709-720, 2004.
[5] C. Bahlmann and H. Burkhardt, "Measuring HMM similarity with the Bayes probability of error and its application
to online handwriting recognition," Proceedings of Sixth International Conference on Document Analysis and
Recognition, Seattle, WA, USA, pp. 406-411, 2001.
[6] J.A. R-Serrano, F. Perronnin, "A Model-based sequence similarity with application to handwritten word spotting,"
in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2108-2120, Nov. 2012.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2367 - 2374
2374
[7] Q. Wang, F. Yin and C. Liu, "Handwritten Chinese Text Recognition by Integrating Multiple Contexts," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1469-1481, Aug. 2012.
[8] Singh, R., Raj, B., Stern, R. M., "Structured redefinition of sound units by merging and splitting for improved
speech recognition," In Sixth International Conference on Spoken Language Processing, 2000.
[9] Huo, Q., Li, W., "A DTW-based dissimilarity measure for left-to-right hidden Markov models and its application to
word confusability analysis," In Ninth International Conference on Spoken Language Processing, 2006.
[10] J. Hershey, P. Olsen, and S. Rennie, "Variational Kull back-Leibler Divergence for Hidden Markov Models,"
Proc. Workshop Automatic Speech Recognition Understanding, 2007.
[11] M. Brand and V. Kettnaker, "Discovery and segmentation of activities in video," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 22, no. 8, pp. 844-851, Aug. 2000.
[12] Willett, P. et al., "Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using
2D fragment bit-strings," Combinatorial chemistry & high throughput screening, vol. 5, no. 2, pp. 155-166, 2002.
[13] Seung-Seok C, Sung-Hyuk C and Tappert C., "A Survey of Binary Similarity and Distance Measures, Journal of
Systemics, Cybernetics & Informatics, vol. 8, chapter 1, pp. 43–48, 2010.
[14] Cha S., "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,"
Int. J. Math. Model. Methods Appl. Sci, vol. 1, chapter 4, pp. 300–307, 2007.
[15] Chang, W. L., Kanesan, J., Kulkarni, A. J., Ramiah, H., "Data clustering using seed disperser ant algorithm,"
Turkish Journal of Electrical Engineering & Computer Sciences, vol. 25, no. 6, pp. 4522-4532, 2017.
[16] Ortega, J.P., Del, M., Rojas, R.B., Somodevilla, M.J., "Research issues on k-means algorithm: An experimental
trial using matlab," In CEUR Workshop Proceedings: Semantic Web and New Technologies, pp. 83-96, Mar. 2009.
[17] Lowe, D. G., "Distinctive image features from scale-invariant keypoints," International journal of computer vision,
vol. 60, no. 2, pp. 91-110, 2004.
[18] Marti, U-V., Bunke, H., "Using a statistical language model to improve the performance of an HMM-based cursive
handwriting recognition system," Int J Pattern Recognit Artif Intell, vol. 15, no. 01, pp. 65-90, 2001.
[19] Nowak, E., Jurie, F., Triggs, B., "Sampling strategies for bag-of-features image classification," in European
Conference on Computer Vision, Lecture Notes in Computer Science(LNCS) 3954, pp. 490-503, 2006.
[20] Ernest Bellman, R., "Dynamic programming,” Princeton University Press, Rand Corporation, 1957,
[21] Ernest Bellman, R., "Adaptive control processes: a guided tour," Princeton University Press.
[22] Sankar, K.P., Manmatha, R., Jawahar, C.V., "Large-scale document image retrieval by automatic word annotation,"
International Journal on Document Analysis and Recognition (IJDAR), vol. 17, no. 1, pp. 1-17, 2013.
[23] Almazán, J., Gordo, A., Fornés, A., et al., "Segmentation-free word spotting with exemplar SVMs," Pattern
Recognition, vol. 47, no. 12, pp. 3967-3978, 2014.
[24] Rusiñol, M., Aldavert, D., Llados, R.J., "Efficient segmentation-free key word spotting in historical document
collections," Pattern Recognition, vol. 48, pp. 545-555, 2015.
[25] Howe, H., Rath, T., Ranmatha, R., "Boosted decision trees for word recognition in handwritten document
retrieval," in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 377-383, 2005.
[26] Liang, Y., Fairhurst, M., Guest, R.., "A synthesized word approach to word retrieval in handwritten documents,"
Pattern Recognit, vol. 45, no. 12, pp. 4225-4236, 2012.
[27] Fischer, A., Keller, A., Frinken, V., et al., "Lexicon-free handwritten word spotting using character HMMs,"
Pattern Recognit.Lett, vol. 33, no. 7, pp. 934-942, 2012.
[28] Terasawa, K., Tanaka, y., "Slit style HOG feature for document image word spotting," in Proceeding of
the International Conference on Document Analysis and Recognition, pp. 116-120, 2009.

More Related Content

PDF
An improvised similarity measure for generalized fuzzy numbers
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
Combination of Similarity Measures for Time Series Classification using Genet...
PDF
Improved probabilistic distance based locality preserving projections method ...
PDF
A Novel Algorithm for Design Tree Classification with PCA
PDF
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
PDF
A Novel Clustering Method for Similarity Measuring in Text Documents
PDF
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
An improvised similarity measure for generalized fuzzy numbers
International Journal of Engineering and Science Invention (IJESI)
Combination of Similarity Measures for Time Series Classification using Genet...
Improved probabilistic distance based locality preserving projections method ...
A Novel Algorithm for Design Tree Classification with PCA
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A Novel Clustering Method for Similarity Measuring in Text Documents
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...

What's hot (17)

PDF
Discretization methods for Bayesian networks in the case of the earthquake
PDF
Exploration of Imputation Methods for Missingness in Image Segmentation
PDF
International Journal of Engineering Research and Development
PPTX
Recommendation system
PDF
One dimensional vector based pattern
PPT
Textmining Retrieval And Clustering
PDF
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
PDF
Fault diagnosis using genetic algorithms and principal curves
PDF
Saif_CCECE2007_full_paper_submitted
PDF
Unsupervised learning clustering
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
PDF
Textural Feature Extraction of Natural Objects for Image Classification
PDF
Medical diagnosis classification
PDF
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
PDF
I0341042048
PDF
A comprehensive survey of contemporary
Discretization methods for Bayesian networks in the case of the earthquake
Exploration of Imputation Methods for Missingness in Image Segmentation
International Journal of Engineering Research and Development
Recommendation system
One dimensional vector based pattern
Textmining Retrieval And Clustering
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
Fault diagnosis using genetic algorithms and principal curves
Saif_CCECE2007_full_paper_submitted
Unsupervised learning clustering
IRJET- Performance Evaluation of Various Classification Algorithms
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
Textural Feature Extraction of Natural Objects for Image Classification
Medical diagnosis classification
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
I0341042048
A comprehensive survey of contemporary
Ad

Similar to Combined cosine-linear regression model similarity with application to handwritten word spotting (20)

PDF
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
PPTX
Data Mining Lecture_5.pptx
PDF
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
PDF
call for papers, research paper publishing, where to publish research paper, ...
PDF
L0261075078
PDF
L0261075078
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PPT
similarities-knn-1.ppt
PPTX
unit-2-part-5 material_mmmmmmmmmmmmmmmmm
DOCX
Measures of Similaritv and Dissimilaritv 65the comparison .docx
PPTX
Unit - III Vector Space Model in Natura Languge Processing .pptx
PDF
Words in space
PDF
Words in Space - Rebecca Bilbro
PDF
A Visual Exploration of Distance, Documents, and Distributions
PPTX
similarities-knn.pptx
PDF
Bl24409420
PDF
Cluster Analysis: Measuring Similarity & Dissimilarity
PPTX
komal (distance and similarity measure).pptx
PDF
Simple semantics in topic detection and tracking
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
Data Mining Lecture_5.pptx
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
call for papers, research paper publishing, where to publish research paper, ...
L0261075078
L0261075078
International Journal of Engineering and Science Invention (IJESI)
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
similarities-knn-1.ppt
unit-2-part-5 material_mmmmmmmmmmmmmmmmm
Measures of Similaritv and Dissimilaritv 65the comparison .docx
Unit - III Vector Space Model in Natura Languge Processing .pptx
Words in space
Words in Space - Rebecca Bilbro
A Visual Exploration of Distance, Documents, and Distributions
similarities-knn.pptx
Bl24409420
Cluster Analysis: Measuring Similarity & Dissimilarity
komal (distance and similarity measure).pptx
Simple semantics in topic detection and tracking
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...

Recently uploaded (20)

PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
Soil Improvement Techniques Note - Rabbi
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PPTX
Artificial Intelligence
PPT
Total quality management ppt for engineering students
PPTX
Feature types and data preprocessing steps
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
introduction to high performance computing
PPT
Occupational Health and Safety Management System
PDF
Design Guidelines and solutions for Plastics parts
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
Soil Improvement Techniques Note - Rabbi
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Artificial Intelligence
Total quality management ppt for engineering students
Feature types and data preprocessing steps
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Exploratory_Data_Analysis_Fundamentals.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
Fundamentals of safety and accident prevention -final (1).pptx
introduction to high performance computing
Occupational Health and Safety Management System
Design Guidelines and solutions for Plastics parts

Combined cosine-linear regression model similarity with application to handwritten word spotting

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 10, No. 3, June 2020, pp. 2367~2374 ISSN: 2088-8708, DOI: 10.11591/ijece.v10i3.pp2367-2374  2367 Journal homepage: http://ijece.iaescore.com/index.php/IJECE Combined cosine-linear regression model similarity with application to handwritten word spotting Youssef Elfakir, Ghizlane Khaissidi, Mostafa Mrabti, Driss Chenouni, Manal Boualam Laboratory of Computing and Interdisciplinary Physics, ENS, Sidi Mohamed Ben Abdellah University, Morocco Article Info ABSTRACT Article history: Received Apr 5, 2019 Revised Nov 9, 2019 Accepted Nov 25, 2019 Abstract: the similarity or the distance measure have been used widely to calculate the similarity or dissimilarity between vector sequences, where the document images similarity is known as the domain that dealing with image information and both similarity/distance has been an important role for matching and pattern recognition. There are several types of similarity measure, we cover in this paper the survey of various distance measures used in the images matching and we explain the limitations associated with the existing distances. Then, we introduce the concept of the floating distance which describes the variation of the threshold’s selection for each word in decision making process, based on a combination of Linear Regression and cosine distance. Experiments are carried out on a handwritten Arabic image documents of Gallica library. These experiments show that the proposed floating distance outperforms the traditional distance in word spotting system. Keywords: Bag-of-visual word Features extractions Floating threshold Handwritten Arabic documents Similarity distance Copyright © 2020 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Youssef Elfakir, Laboratory of Computing and Interdisciplinary Physics, National Superior School, Sidi Mohammed Ben Abdellah University, Fes 30050, Morocco. Email: youssef.elfakir1@usmba.ac.ma 1. INTRODUCTION In the pattern recognition fields, the objects can be represented as sequence of features, where a similarity measure can be used as a tool to judge the similarity or dissimilarity between two sequences of features. From literature, similar is defined as looking or being near between two objects, but not the same. In this field, several similarity and distance measure have been used widely in various fields of the area: text similarity [1], document similarity [2], handwriting recognition [3-5], handwritten character [6, 7], speech recognition [8-10], video analysis [11]. So use a good similarity measure is fundamental step for many types of application and domain such as information retrieval and recognition, chemistry, clustering or classification. The term similarity in the pattern recognition is mean a score that represents the strength of relationship between two sequences of data items. The pattern recognition or the word sporting is done based on a similarity measure to search a similar image that are looking or being near together. Mathematically, similarity distance is used to calculate how far between two sequences of data, is also named dissimilarity in other domains, or the concept is though the same and inversely [12]. Figure 1 shows the chronological overview of the similarity and distance measure that is commonly used in various fields, based on feature sequences of data [13, 14], which are divided into three different groups, one group distance based, the second and the third groups of similarity measure are respectively correlation and non-correlation based. Similarity measure is an important topic been extensively applied in some fields such as decision making, pattern recognition, machine learning and market prediction based on distance function such as Euclidean, Manhattan, Minkowski and Cosine distance similarity, etc. Or, many limitations associated with distance measures which have been have mentioned in this paper and are aiming to overcome in our research.
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2367 - 2374 2368 Figure 1. Chronological overview of the similarity and distance measures The organization of this paper is as follows. First we describe on some of chronological overview of the similarity and distance measures used for similarity searching and retrieving systems. Then, we investigate the limitations of the existing distances in word spotting system in detail. Finally, we propose a floating distance based on the combination of Linear Regression and cosine distance and we apply them on a handwritten Arabic image documents of Gallica. 2. OVERWIEW OF SIMILARITY AND DISTANCE METRICS For years, different distance measures are used to calculate the similarity between two data objects, the aim of these metrics is to found a specific distance function that allow a separation or classification between elements in a set of data [15, 16]. In fact, these distances operate directly in various fields as similarity searching and retrieval systems where the performance depends upon choosing function. Here, we present several similarity distances commonly used in pattern recognition and word spotting systems. 2.1. Euclidean distance The Euclidean distance (1) is defined as the general and the standard metric for used in geometrical cases, the mathematical formula of this distance is defined as : √∑ (1) Which is a simple metric between two points of data ( that determines the root of square difference between them, is the default metric in the k-means algorithm and the most used in different clustering problems. 2.2. Manhattan distance Manhattan or Taxicab distance (2) is a metric that calculates the absolute difference between two points of data, is also known as L1 distance, L1 distance or rectilinear distance, is defined as: | | (2) 2.3. Minkowski distance The Minkowski distance (3) known as the generalized metric, can be used in case of data that are ordinal and quantitative. The distance of order p between two variables is defined as: √∑ (3) This distance is similar to the Euclidean distance when p=2 and the Manhattan distance when p=1.
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  Combined cosine-linear regression model similarity with application to handwritten… (Youssef Elfakir) 2369 2.4. Cosine distance Cosine distance (4) is the cosine of the angle between two vectors in an n-dimensional space given by the following formula: ‖ ‖ ‖ ‖ ∑ √∑ √∑ (4) This distance represents the dot product between two vectors divided by the product of the two vectors' lengths, and is often used in information retrieval and text mining. 2.5. Jaccard distance Jaccard distance (coefficient), a term coined by Paul Jaccard, measures similarities of the two data items. The Jaccard distance (5) is measured by the following formula: | | | | (5) The function is best used when calculating the similarity between small numbers of sets, and is often used to in the recommendation flied. 2.6. Chebyshev distance Chebyshev distance (6), is also called Tchebychev or the maximum value distance. This metric calculates the absolute magnitude of the difference between two points of data by following: (| |) (6) 3. COMPARISON OF DISTANCES METRIC IN WORD SPOTTING SYSTEMS The aim of Query-by-example word spotting is search and localize they regions that are similar for a given query. Or to get the best performance still as a challenge, because several tasks influence on this system, as image processing, feature extractions, similarity measures, classification. While our work focuses on the influence of similarity measure distances in the context of a handwritten document images retrieval where many existing metrics can be adapted. We underline that the presented measure of similarity is applicable to any problem where the size of query is different one from the other. In this part, we present a comparison of distance measures in word spotting system. For this, we use the Arabic handwritten document images form Gallica, which is the digital library of the National Library of France, in open access. First, we divide the document images into a set of local regions, densely sampled; these local regions are the basic structure used to spot the words in the document, and we extract all interests points from the images using SIFT [17] detector, and we use SIFT descriptor to represent each interest point in the images by their descriptor. Then, in the learning step, we select the 4 first images of document to group all descriptors to provide bag of visual descriptors and cluster centers, in this stage, the k-means algorithm is used with different number of centers (100,200,300,400). In the end, we encoding each region by a histogram using a bag-of-visual-descriptors [18, 19] instead to represent them by their descriptors. All regions are described with their histograms by assigning each descriptor in the region to the nearest visual descriptor in the codebook. So, each region is represented by a histogram of accumulates frequencies where N represents the number of the visual descriptor in the codebook and represents the cumulative frequency of k visual descriptor in the jeme region. The process of the used system is presented in Figure 2. To evaluate the effect of similarity measure and determine the appropriate distance in the context of word spotting, we change this metric and we calculate the detection accuracy represented by F-score for different queries. The F-score is calculated as follows: (7) Table 1 shows the F-score results for Gallica dataset, we observe that the F-score depending on the codebook size and also the metric distance used. For the cosine metric and code book size 300 we obtain F-score=0,78. While for the other metric we reported F-socre=0, 62 for Euclidean distance, F-socre = 0,52 for Chebyshev distance. So, the best result is obtained for the cosine metric and 300 clusters.
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2367 - 2374 2370 Figure 2. Process of the proposed system Table 1. F-score for different codebook sizes and different metric distances Codebook size 100 200 300 400 Euclidean distance 0.27 0.58 0.62 0.53 Manhattan distance 0.4 0.46 0.5 0.48 Minkowski distance 0.15 0.23 0.34 0.30 Cosine distance 0.627 0.76 0.78 0.629 Jaccard distance 0.27 0.36 0.5 0.41 Chebyshev distance 0.37 0.42 0.52 0.45 4. LINEAR REGRESSION MODEL SIMILARITY WITH APPLICATION TO HANDWRITTEN WORD SPOTTING 4.1. Analysis of cosine distance measure for word spotting As shown in the previous part, the metric that gives the best result is the cosine distance. However, Table 1 shows that the size of the codebook influence directly on the performance. So we believe it is therefore necessarily to analyses diverse phenomena that appear when using codebook in high-dimensional spaces, this phenomena can be explain by the curse of dimensionality, the term was invented by Richard E.Bellman in [20, 21]. For this, the system presented in Figure 2 is therefore used to evaluate in detail different sizes of the codebook from 100 to 900 centers. Then, to calculate the similarity between the histograms of each region in the image document and the query’s histogram, we use the cosine similarity defined by: ∑ √∑ √∑ (8) Where represents the occurrence of the ieme center of the codebook in the jeme region, and represents the occurrence of the ieme center of the codebook. Or, to select them regions that are similar to the query, the cosine similarity “S” should be inferior to a certain threshold. For two regions perfectly similar ∑ √∑ √∑ , so S=0. Therefore, as much as the threshold is less, the regions are more similar. For this, for each codebook size, we calculate the F-Score by choosing the best threshold of each size as shown in Figure 3. We observe that the size influence on the performance and best result is given for k=300 and decrease beyond this size. Figure 3. The F-Score curve
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  Combined cosine-linear regression model similarity with application to handwritten… (Youssef Elfakir) 2371 Now, we search the influence of the codebook size on the threshold. Form (2), when the size of the codebook increase (N), the probability to find zero descriptor in each cellule in the histogram for a given region increase as shown in Figure 3, that mean: i , hen increase, the probability i increase hen decrease, the probability i decrease Subsequently, i , hen increase, the probability ∑ ( i,j) i decrease hen decrease, the probability ∑ ( i,j) i increase ∑ √∑ √∑ ∑ √∑ √∑ Where Ri represents the histogram of the query, Hi,j represents the histogram of jeme region in the document and ∑ ( i,j) is the number of interest points in jeme region. So when the size of histogram (N) increase, the number of zero cellule increase too. Consequently, the threshold should take a count the size of the codebook (number of cluster). Figure 4 shows example of query and region histograms. For the experiments step, Figure 5 shows the curve of the best threshold for each codebook size. The threshold depends on the size of the codebook and can represented by : y=0.000265x+0.3993. Figure 4. Example of query and region histograms Figure 5. Best threshold for each codebook size At this stage, and to perform the experiments, we should search the influence of the interest points number (number of descriptors n) on the threshold. Form (2), when the number of interest points n increase, the probability to find zero descriptor in each cellule in the histogram for a given region decrease as shown in Figure 3, that mean: i , hen n increase, the probability i decrease hen n decrease, the probability i increase
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2367 - 2374 2372 Subsequently, i , hen n increase, the probability ∑ ( i,j) i increase hen n decrease, the probability ∑ ( i,j) i decrease ∑ √∑ √∑ ∑ √∑ √∑ In fact, when the number of interest points (n) increase, the number of zero cellule decrease too. Consequently, the threshold should take a count the number of interest points (number of descriptors), because this number change from region to other. Experimently, Figure 6 shows the threshold curve according to the number of descriptors where we can see that the threshold depend on the descriptors, this dependence can be represented by: y=-0.0009x+0.5219. Figure 6. Threshold according to the number of descriptors Given this condition, two regions are similar if the threshold S is less than a given value, or if two having even number of interest points will have same threshold, but this not mean that they are similar, because maybe they have different histograms and thereafter different distances of similarity. To conclude this part, each word/region in document has a certain number of interest points/descriptors, so a compromise between the number of descriptors and the threshold must be sought. 4.2. Combination of cosine-linear regression similarity Regression focuses on the relationship between the outcomes and the inputs. It also provides a model that has some explanatory value, in our case, the inputs are size of codebook and number of interest points, the outcome is the threshold value that define if two regions are similar or not. Linear regression is a commonly used technique for modeling when the outcome variable is expressed as linear combination of the input variables, for a given set of input variables, the linear regression model provides the expected outcome value. + (9) where is the threshold variable are the input variables, for j , , ,p- is the value of when each equals ero {Offset} is the change in based on a unit change in {Coefficient} , and the ’s are independent of each other This random error ( , denoted by ɛ, is assumed to be normally distributed with a mean of ero and a constant variance ). So, the used Equation is: T=-0, 0009 * n+0, 5219 for N=300.
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  Combined cosine-linear regression model similarity with application to handwritten… (Youssef Elfakir) 2373 However, the performance of regression analysis methods in practice depends on the type and form of the used data, and how it relates to the used regression method. As is shown in Figures 4 and 5, the threshold depends on the codebook size and the number of descriptors at each word, or the proposed method is a generic similarity measure that take a count these parameters and it does not make any specific condition to handwritten documents. The proposed similarity measure is evaluated in the context of word spotting task. The aim consists in querying Arabic of handwritten documents with query-by-example anagram and in selecting/reporting regions that similar to the query. We report here some queries where the system yields automatically as shown in Figure 7, which are similar to the query and without chose the thresold , which is a problem in other system [22, 23]. Then, we use a filtering step to select one best result when confusion regions are returning, based on theirs similarity scores and theirs positions. Table 2 shows a comparison of the proposed method with state-of- the-art in term of Precision, tseted in the Gallica dataset. An improvement is observed with the proposed similarity distance over other methods. Table 2. Performance of the proposed method and other works Method Precision Almazon et al. [23] 68.4% Rusinol et al. [24] 81% Proposed method 83% Howe et al. [25] 79% Liang et al. [26] 67% Fischer et al.[27] 62% Terasawa et al. [28] 79% Figure 7. The retrieved regions for some queries 5. CONCLUSION In this paper, we present a generic similarity measure for word spotting system that take a count the codebook size and the number of descriptors at query, and it does not make any specific condition to handwritten documents,a comparison of the proposed method with other methods shows that generic similarity measure gives an excellent result in term of Precision. We tested our method using experimental setup based on MATLAB code applied to Gallica database REFERENCES [1] Mihalcea R., Corley C. and Strapparava C., "Corpus-based and knowledge-based measures of text semantic similarity," In AAAI, vol. 6, pp. 775-780, 2006. [2] Huang, Anna, "Similarity measures for text document clustering," In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pp. 9-56, 2008. [3] Marti, U. V., Bunke, H., "Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system," In Hidden Markov models: applications in computer vision, pp. 65-90, 2001. [4] H. Bunke, et. al., "Offline recognition of unconstrained handwritten texts using HMMs and statistical language models," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 709-720, 2004. [5] C. Bahlmann and H. Burkhardt, "Measuring HMM similarity with the Bayes probability of error and its application to online handwriting recognition," Proceedings of Sixth International Conference on Document Analysis and Recognition, Seattle, WA, USA, pp. 406-411, 2001. [6] J.A. R-Serrano, F. Perronnin, "A Model-based sequence similarity with application to handwritten word spotting," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2108-2120, Nov. 2012.
  • 8.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2367 - 2374 2374 [7] Q. Wang, F. Yin and C. Liu, "Handwritten Chinese Text Recognition by Integrating Multiple Contexts," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1469-1481, Aug. 2012. [8] Singh, R., Raj, B., Stern, R. M., "Structured redefinition of sound units by merging and splitting for improved speech recognition," In Sixth International Conference on Spoken Language Processing, 2000. [9] Huo, Q., Li, W., "A DTW-based dissimilarity measure for left-to-right hidden Markov models and its application to word confusability analysis," In Ninth International Conference on Spoken Language Processing, 2006. [10] J. Hershey, P. Olsen, and S. Rennie, "Variational Kull back-Leibler Divergence for Hidden Markov Models," Proc. Workshop Automatic Speech Recognition Understanding, 2007. [11] M. Brand and V. Kettnaker, "Discovery and segmentation of activities in video," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 844-851, Aug. 2000. [12] Willett, P. et al., "Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings," Combinatorial chemistry & high throughput screening, vol. 5, no. 2, pp. 155-166, 2002. [13] Seung-Seok C, Sung-Hyuk C and Tappert C., "A Survey of Binary Similarity and Distance Measures, Journal of Systemics, Cybernetics & Informatics, vol. 8, chapter 1, pp. 43–48, 2010. [14] Cha S., "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions," Int. J. Math. Model. Methods Appl. Sci, vol. 1, chapter 4, pp. 300–307, 2007. [15] Chang, W. L., Kanesan, J., Kulkarni, A. J., Ramiah, H., "Data clustering using seed disperser ant algorithm," Turkish Journal of Electrical Engineering & Computer Sciences, vol. 25, no. 6, pp. 4522-4532, 2017. [16] Ortega, J.P., Del, M., Rojas, R.B., Somodevilla, M.J., "Research issues on k-means algorithm: An experimental trial using matlab," In CEUR Workshop Proceedings: Semantic Web and New Technologies, pp. 83-96, Mar. 2009. [17] Lowe, D. G., "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004. [18] Marti, U-V., Bunke, H., "Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system," Int J Pattern Recognit Artif Intell, vol. 15, no. 01, pp. 65-90, 2001. [19] Nowak, E., Jurie, F., Triggs, B., "Sampling strategies for bag-of-features image classification," in European Conference on Computer Vision, Lecture Notes in Computer Science(LNCS) 3954, pp. 490-503, 2006. [20] Ernest Bellman, R., "Dynamic programming,” Princeton University Press, Rand Corporation, 1957, [21] Ernest Bellman, R., "Adaptive control processes: a guided tour," Princeton University Press. [22] Sankar, K.P., Manmatha, R., Jawahar, C.V., "Large-scale document image retrieval by automatic word annotation," International Journal on Document Analysis and Recognition (IJDAR), vol. 17, no. 1, pp. 1-17, 2013. [23] Almazán, J., Gordo, A., Fornés, A., et al., "Segmentation-free word spotting with exemplar SVMs," Pattern Recognition, vol. 47, no. 12, pp. 3967-3978, 2014. [24] Rusiñol, M., Aldavert, D., Llados, R.J., "Efficient segmentation-free key word spotting in historical document collections," Pattern Recognition, vol. 48, pp. 545-555, 2015. [25] Howe, H., Rath, T., Ranmatha, R., "Boosted decision trees for word recognition in handwritten document retrieval," in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 377-383, 2005. [26] Liang, Y., Fairhurst, M., Guest, R.., "A synthesized word approach to word retrieval in handwritten documents," Pattern Recognit, vol. 45, no. 12, pp. 4225-4236, 2012. [27] Fischer, A., Keller, A., Frinken, V., et al., "Lexicon-free handwritten word spotting using character HMMs," Pattern Recognit.Lett, vol. 33, no. 7, pp. 934-942, 2012. [28] Terasawa, K., Tanaka, y., "Slit style HOG feature for document image word spotting," in Proceeding of the International Conference on Document Analysis and Recognition, pp. 116-120, 2009.