SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 11, No. 1, February 2021, pp. 879~891
ISSN: 2088-8708, DOI: 10.11591/ijece.v11i1.pp879-891  879
Journal homepage: http://ijece.iaescore.com
Similarity-preserving hash for content-based audio retrieval
using unsupervised deep neural networks
Petcharat Panyapanuwat, Suwatchai Kamonsantiroj, Luepol Pipanmaekaporn
Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Thailand
Article Info ABSTRACT
Article history:
Received Jan 1, 2020
Revised Jun 8, 2020
Accepted Aug 18, 2020
Due to its efficiency in storage and search speed, binary hashing has become
an attractive approach for a large audio database search. However, most
existing hashing-based methods focus on data-independent scheme where
random linear projections or some arithmetic expression are used to construct
hash functions. Hence, the binary codes do not preserve the similarity and
may degrade the search performance. In this paper, an unsupervised
similarity-preserving hashing method for content-based audio retrieval is
proposed. Different from data-independent hashing methods, we develop a
deep network to learn compact binary codes from multiple hierarchical layers
of nonlinear and linear transformations such that the similarity between
samples is preserved. The independence and balance properties are included
and optimized in the objective function to improve the codes. Experimental
results on the Extended Ballroom dataset with 8 genres of 3,000 musical
excerpts show that our proposed method significantly outperforms state-of-
the-art data-independent method in both effectiveness and efficiency.
Keywords:
Content-based audio retrieval
Deep learning
Deep neural networks
Similarity-preserving hash
Unsupervised learning
This is an open access article under the CC BY-SA license.
Corresponding Author:
Petcharat Panyapanuwat,
Department of Computer and Information Science,
King Mongkut’s University of Technology North Bangkok,
Bangkok, Thailand.
Email: panyapetch@hotmail.com
1. INTRODUCTION
With rapidly growing database of digital audio recordings, the novel retrieval strategies have
received great attention. Early retrieval approach uses textual metadata describing the content of music audio
(e.g., artist name, song title, album name, genre, or release year of music). In case such descriptions are not
available, it is required content-based retrieval strategy that the perceptual aspects of the audio are utilized. [1].
Content-based audio retrieval approach is generally solved with two steps: first, features are
extracted from the audio file and then used to build indexes for searching. Two main issues of performing a
search over a large database are search speed and efficient storage. The most interesting approach for handling
these problems is binary hashing, where the high-dimensional features are encoded into compact binary codes.
There have been several hashing methods proposed in the literature. They can be devided into two
categories, data-independent methods and data-dependent methods. Methods in data-independent category [2-7]
use random linear projections or some arithmetic expression to construct hash functions. Without the training
process, they are robust to data variation. However, such methods require long hash codes to achieve high
precision. This increases the storage cost and degrades the search efficiency [8].
Methods in data-dependent category, also called learning to hash methods, aim to learn a set of hash
functions from available training data that yield compact codes to achieve satisfactory search performance [9].
Existing data-dependent methods can be classified into unsupervised, supervised, and semi-supervised
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891
880
learning approach. Unsupervised hashing methods [10-12] use unlabeled data to build the hash functions
where the neighbor distance (e.g., L2 norm) among the training data is preserved. Supervised or semi-
supervised hashing methods [13-17] attemp to improve the quality of hashing by leveraging the semantic
labels into the learning process. Compared with data-independent methods, it appears that data-dependent
methods can achieve better accuracy with shorter codes [12, 14, 17]. However, data-dependent methods may
be too dependent on the training data [18].
There are both advantages and shortcomings of using data-independent and data-dependent
methods. However, the previous works of the two categories do not fully take into consideration the
similarity preserving and this may degrade the retrieval performance. In this work, an unsupervised
similarity-preserving hashing method for content-based audio retrieval is proposed. We develop a deep
network with several hierarchical layers of nonlinear and linear transformations to learn compact binary
codes where the similarity between samples is preserved. Furthermore, the independence and balance
properties are included in the objective function to improve the codes. The proposed method is compared
with the Shazam algorithm [3], the data-independent hashing method, in terms of accuracy, precision, recall,
false positive rate, and the storage cost.
2. BACKGROUND
2.1. Learning to hash
Learning to hash attempts to learn a hash function 𝑦 = ℎ(𝑥) that maps a high-dimensional input
item 𝑥 ∈ 𝑅𝐷
to a compact code 𝑦, aiming to improve the search performance [19]. There are 4 topics to
consider for the learning to hash: (1) hash function, (2) similarity-preserving, (3) loss function, and (4) deep
learning to hash.
2.1.1. Hash function
There are several ways to design hash functions. The most widely used hash functions are
generalized by linear projection as shown in (1).
𝑦 = ℎ(𝑋) = sgn(𝑓(𝑊𝑇
𝑋 + 𝑏)) 

where 𝑦 ∈ {0,1} or {−1,1}, 𝑋 = {x𝑛}𝑛=1
𝑁
∈ 𝑅𝐷x𝑁
is the training set which contains 𝑁 samples, 𝐷 is
the dimension of input vector, 𝑊 = {𝑤𝑘}𝑘=1
𝐾
∈ 𝑅𝐷x𝐾
is the projection vector, 𝐾 is number of hash bits, 𝑏 is
the bias variable, sgn(𝑧) = −1 or 0 if 𝑧 < 0 and sgn(𝑧) = 1 otherwise, 𝑓(∙) is a predefined function which
can possibly be neural networks or nonlinear function. However, using different 𝑓(∙) yields different hash
function properties.
2.1.2. Similarity-preserving
The distance 𝑑𝑖𝑗 between two items 𝑥𝑖 and 𝑥𝑗 can be defined by the standardized Euclidean distance
‖𝑥𝑖 − 𝑥𝑗‖
2
or others. The similarity 𝑠𝑖𝑗 between those items is often defined as a function of the distance 𝑑𝑖𝑗
(e.g., Gaussian function, cosine similarity, and so on). In addition, the semantic similarity approach is
generally used in similarity search application. We can apply any distance to the hashing algorithm for
semantic similarity, such as Euclidean distance, by defining semantic similarity 𝑠𝑖𝑗 = 1 for adjacent points
and 𝑠𝑖𝑗 = 0 or −1 for farther points.
In the hash coding space, the Hamming distance 𝑑𝑖𝑗
𝐻
between the code 𝑦𝑖 and 𝑦𝑗 can be defined as
‖𝑦𝑖 − 𝑦𝑗‖
1
= ∑ ‖ℎ𝑘(𝑥𝑖) − ℎ𝑘(𝑥𝑗)‖
𝐾
𝑘=1 . It is the number of binary digits where the values are different.
Hamming similarity is defined as 𝑠𝑖𝑗
𝐻
= 𝐾 − 𝑑𝑖𝑗
𝐻
for the codes valued by 1 and 0. For the codes valued by 1
and -1, the inner product 𝑠𝑖𝑗
𝐻
= 𝑦𝑖
𝑇
𝑦𝑗 is defined as the similarity.
Let’s focus on the term of similarity preserving. In Figure 1(a), there is a set of three points (𝑥1, 𝑥2,
and 𝑥3) in an input space. By measuring the Euclidean distance between the points, we can find that 𝑥1 is
closer to 𝑥2 than to 𝑥3, i.e., 𝑥1 is more similar to 𝑥2 than 𝑥3. The ℎ(𝑥1), ℎ(𝑥2), and ℎ(𝑥3) are
the representations of 𝑥1, 𝑥2, and 𝑥3 in the hash coding space (or Hamming space), respectively.
From the Figure 1(b), we can see ℎ(𝑥1) is closer to ℎ(𝑥3) while ℎ(𝑥2) is far away. In this case, it shows that
the similarities are not preserved. Figure 1(c), on the other hand, shows an example of the similarities that are
well preserved.
Int J Elec & Comp Eng ISSN: 2088-8708 
Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat)
881
Similar
x1
x2
x3
Dissimilar
Hamming Space (2-dim)
h(x2)
h(x3)
0
1
1
h(x1)
Hamming Space (2-dim)
h(x1)
h(x2)
h(x3)
0
1
1
Similarities are not
preserved
Similarities are
well preserved
Hashing
Hashing
(a) (b)
(c)
Figure 1. Similarity-preserving hashing
2.1.3. Loss function
The loss function is intended to preserve the similarity order, i.e., minimize the difference between
the nearest neighbor search result in the hash coding space and the search result in the input space. The loss
function 𝐿𝑜𝑠𝑠(𝑋, 𝑊) is defined as follows:
𝐿𝑜𝑠𝑠(𝑋, 𝑊) = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ ‖𝑑𝑖𝑗 − 𝑑𝑖𝑗
𝐻
‖
2
𝑁
𝑥𝑖,𝑥𝑗∈𝑋  
where 𝑋 is the input data, and 𝑊 is the projection vector.
Specifically, 𝑦𝑖 = ℎ(𝑥𝑖) needs to be binary. This binary constraint leads to a difficult optimization
problem. To solve the problem, we drop the binary constraint and let the codes be continuous. The codes are
then binarized with thresholding. For binary constraint relaxation, various standard optimization techniques
can be applied.
2.1.4. Deep learning to hash
The goal of learning to hash is to learn the specific hash functions that map high dimensional input
vector to a compact binary vector that yields a good quality of retrieval and search speed [20]. For unlabeled
data, an illustration of unsupervised deep learning to hash model that map the input vector 𝑥 ∈ 𝑅𝐷
to
compact binary codes is shown in Figure 2.
w(1)
w(L -1)
w(L)
...
...
...
Layer 1 Layer 2 Layer L-1 Layer L Layer L+1
(Input Layer) (Reconstruction Layer)
D
x R
 ˆ D
x R

Figure 2. Unsupervised deep learning model
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891
882
Assume that an unsupervised deep network consists of L+1 layers. A binary vectoy 𝑦𝑖 is generated
by passing the input vector 𝑥𝑖 through the network that contains multiple hierarchical layers of nonlinear
functions. The binary code of 𝑥𝑖 at Lth layer can be calculated as follows:
𝑦𝑖 = ℎ(𝑥𝑖) = 𝑠𝑔𝑛(𝐹(𝑥𝑖,𝑊)) 
where 𝐹(𝑥𝑖, 𝑊) is a composition of nonlinear transformations defined as follows:
𝐹(𝑥𝑖, 𝑊) = 𝑓𝐿(∙∙∙ 𝑓2 (𝑓
1(𝑥𝑖, 𝑤(1)), 𝑤(2)) ∙∙∙ 𝑤(𝐿)) 
where the vector 𝑥𝑖 and the weight vector 𝑤(𝑙) are used as input, the projection 𝑥𝑖+1 is produced by 𝑓𝑖(∙). The
learning algorithm aims to learn a set of nonlinear weight vectors 𝑊 = {𝑤(1), …, 𝑤(𝐿)} where the information
from the input space is preserved.
2.2. Search with hashing
There are two strategies to perform a search with hashing, hash code ranking and hash table lookup [19].
For the hash code ranking, an exhaustive search is performed by comparing the distance (e.g., Hamming
distance) between the query and the reference items. The items with the smallest distances, called nearest
neighbors, are retrieved. However, the cost of computing the distance results in performance degradation.
The alternative approach, hash table lookup aims to accelerate the search by reducing the distance
computations. The inverse lookup database, called hash table, is composed of buckets which are indexed by
the hash codes. Given the query, the matching items storing in the bucket are retrieved.
2.3. Audio fingerprinting
Audio fingerprinting is best known for its ability to identify an uknown audio recording by using its
compact content-based signature so-called fingerprint [21]. It does this by converting the audio features into
hash codes, aiming to uniquely identify an audio recording. The advantage of fingerprint is that, it reduces
storage costs as fingerprint is relatively small. Moreover, the perceptual irrelevancies have been removed
from fingerprint, resulting in efficient comparison and searching.
3. METHOD
The aim of this paper is to provide an efficient technique that yields a good quality of retrieval and
computational efficiency. In this work, compact binary codes are learned for fingerprint indexing with
unsupervised deep network in a way that the similarity between samples is preserved. Once, a short audio
sample is taken to our content-based audio retrieval system, the system performs database lookup for
matching track and then returns the song ID that the query is taken. As shown in Figure 3, the system is
designed with three steps: (1) Fingerprint feature extraction, (2) Unsupervised similarity-preserving hashing,
and (3) Sequence matching.
Reference Audio
Unsupervised Similarity-Preserving Hashing
Deep Learning
to Hash
Hash Functions
16-bit Hash
X1,X3
X2
...
Xn
Items
Database
16-bit Hash
Sequence
Matching
Song ID
Returned Item
Query Audio
0
20 40 60 100
80 120 140 160 180
Frequency
500
1000
1500
2500
2000
3000
3500
4000
(t2,f2)
(t1,f1)
Δf = f2 - f1, Δt = t2 - t1
Features = [f1, Δf, Δt]
Time
0
20 40 60 100
80 120 140 160 180
Frequency
500
1000
1500
2500
2000
3000
3500
4000
(t2,f2)
(t1,f1)
Candidate Point
Δf = f2 - f1, Δt = t2-t1
Features = [f1, Δf, Δt]
Feature Extraction
Feature Extraction
h1
2
h1
3
h1
19
h2
2
h2
18
h3
16
x3
x4
x2
x1
h1
1
h2
1
h3
1
...
...
...
...
...
1
ˆx
2
ˆx
3
ˆx
4
ˆx
20
x̂
x20
Candidate Point
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Input Layer
Reconstruction
Layer
Target Region
Target Region
Figure 3. The construction of proposed method for our content-based audio retrieval system
Int J Elec & Comp Eng ISSN: 2088-8708 
Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat)
883
3.1. Fingerprint feature extraction
Before fingerprint feature extraction is performed, the audio signal is converted into a common
format for analysis. Next, the time-series audio signal is converted into time-frequency domain from which
more meaningful information can be extracted. Below each aspect is detailed.
3.1.1. Preprocessing and transform
In this paper, the fingerprint extraction presented in [3] is applied. Firstly, we convert the input
audio to mono signal and downsampled from the standard digital audio of 44.1KHz, to 8KHz, to make
the data easier to handle, reducing database size, and increasing speed of the algorithm. The audio signal is
then converted into the time-frequency representation. We perform a short-time fourier transform (STFT)
with a window size of 64 ms. for good spectrum resolution [22] and a hop size of 32 ms. Figure 4 shows
time-frequency graph so-called a spectrogram. On the horizontal axis is time, on the vertical is frequency, and
on the third is intensity. Each point on the graph represents the intensityof a given frequency at specific time.
Figure 4. Spectrogram with peak intensities
3.1.2. Feature extraction
After converting the signal into the time-frequency domain, the features are then extracted from
the spectrum. Due to their robustness to noise and distortions, the amplitude peaks in each frame are selected
as candidate points. Each candidate point is paired with the adjacent peaks. The constellation map of paired
points with coordinate list is shown in Figure 5. In this work, each candidate point is paired within 31
frequency bins and 63 time frames. Only the closest 3 peaks in time to each other are selected. Figure 6
shows the combinatorial association of a pair of two points which is called a ‘landmark’. For each pair, it consists
of four components, the starting frequency 𝑓
1 , the startingtime 𝑡1, the end frequency 𝑓2, and the end time 𝑡2.
Figure 5. A constellation map of paired points
(t2,f2)
Time
Frequency
(t1,f1)
Candidate Point
31 Frequency Bin
63 Time Frame
Δf = f2 - f1, Δt = t2 - t1
Feature = [f1, Δf, Δt]
31 Frequency Bin
Figure 6. The combinatorial association of a pair of
two points
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891
884
3.1.3. Audio fingerprint
For the landmark as mentioned above, the audio fingerprint can be defined as follows:
𝐹𝑖𝑛𝑔𝑒𝑟𝑝𝑟𝑖𝑛𝑡 = [𝑓
1, Δ 𝑓, Δ𝑡] 
where the frequency difference ∆𝑓 = 𝑓2 − 𝑓1, and the time difference between the two points ∆𝑡 = 𝑡2 − 𝑡1.
The fingerprint is also associated with the offset time fromthe beginningof the audio file to the startingtime 𝑡1.
This fingerprint feature [𝑓
1 , Δ 𝑓, Δ𝑡] is used to generate hash code in [3]. The hash model can be
defined as shown in (6).
𝑓1 × 212
+ ∆𝑓 × 26
+ ∆𝑡 
where a fingerprint hash is composed of 8-bit frequency 𝑓1, 6-bit frequency difference ∆𝑓, and 6-bit time
difference ∆𝑡. Figure 7 shows an example of 20-bit hash address calculated from (6).
[216, 18, 36]
Hash Function
885924 =
Hash Address
12 6
1 2 2
f f t
      11011000010010100100
1,
[ , ]
f f t
 
Figure 7. An example of 20-bit hash address
For a 16-bit fingerprint hash, it is composed of 6-bit frequency 𝑓
1 , 5-bit frequency difference ∆𝑓,
and 5-bit time difference ∆𝑡. The hash model can be defined as shown in (7).
𝑓1 × 210
+ ∆𝑓 × 25
+ ∆𝑡 (7)
After the hash code is calculated, the system then uses this code as an index for searching in
the database. An exact matching algorithm is applied in [3]. Unlike the Shazam algorithm, we develop a deep
neural network with multiple hierarchical layers of nonlinear and linear transformations to learn compact
codes from these fingerprint features such that the similarity between samples is preserved. The details are
described further in the next section.
3.2. Unsupervised similarity-preserving hashing (USH)
In this paper, the hash transformations are created by an unsupervised deep neural network.
As shown in Figure 8, there are 5 layers in our deep network: the input layer consists of 20 nodes of input 𝑥𝑖,
the three hidden layers consist of 19, 18, and 16 nodes respectively, and there are 20 nodes of 𝑥𝑖
̂ in
the output layer.
...
...
...
...
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
...
20 20
18
19
16
(1), (1),
i i
f x

(1), (1),
i i
a f

(2),
(2),
1
1 i
i H
f
e


(2), (2),
i i
a f

(3),
(3),
1
1 i
i H
f
e


(3), (3),
i i
a f

(4), (4),
(4), (4),
(4),
i i
i i
i H
H H
H
e e
f
e e





(4), (4),
i i
a f

(5), (5),
i i
f H

(5), (5),
i i
a f

Figure 8. Our proposed unsupervised similarity-preserving hashing network (USH)
Int J Elec & Comp Eng ISSN: 2088-8708 
Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat)
885
Our deep network is learned so that the output of the fourth layer can be used as the binary hash codes.
For the network design, each node is composed of one input summation function and one output transformation
function. The function f(∙) is used to combine information by the links from other nodes, as shown in (8).
𝑛𝑜𝑑𝑒𝑖𝑛 = 𝑓(𝑥(𝑙),1 , 𝑥(𝑙),2 , … , 𝑥(𝑙),𝑛 ; 𝑤(𝑙),1 , 𝑤(𝑙),2 , …, 𝑤(𝑙),𝑛) (8)
where 𝑥(𝑙),1 , 𝑥(𝑙),2 , … , 𝑥(𝑙),𝑛 are the inputs to the node, 𝑤(𝑙),1 , 𝑤(𝑙),2 , …, 𝑤(𝑙),𝑛 are the associated weights,
𝑙 indicates the layer number, and 𝑛 is the number of input nodes. The output activation function 𝑎(𝑓(∙)) is
shown in (9).
𝑛𝑜𝑑𝑒𝑜𝑢𝑡 = 𝑎(𝑙)(𝑛𝑜𝑑𝑒𝑖𝑛) = 𝑎(𝑙)𝑓(∙) (9)
Let 𝐻𝑖 be the sum of product of input 𝑥𝑖 and weight 𝑤𝑖, 𝐻𝑖 = ∑𝑤𝑖 𝑥𝑖, the functions of the nodes
from layer 1 to layer 5 of our proposed network are defined as follows:
Layer 1: In this layer, the nodes only convey inputs to the nodes of the next consecutive layer.
The functions of the 𝑖th node are shown as;
𝑓(1),𝑖 = 𝑥(1),𝑖 and 𝑎(1),𝑖 = 𝑓(1),𝑖 
Layer 2: For this layer, the sigmoid function is used as activation function. Let 𝑥(2),𝑖 = 𝑎(1),𝑖 and
𝐻(2),𝑖 = ∑𝑤(2),𝑖 𝑥(2),𝑖. Thus, the functions of the 𝑖th node are defined as;
𝑓(2),𝑖 =
1
1+𝑒
−𝐻(2),𝑖
and 𝑎(2),𝑖 = 𝑓(2),𝑖  

Layer 3: In this layer, the sigmoid function is applied for activation function. Let 𝑥(3),𝑖 = 𝑎(2),𝑖 and
𝐻(3),𝑖 = ∑𝑤(3),𝑖 𝑥(3),𝑖. The functions of the 𝑖th node are defined as;
𝑓(3),𝑖 =
1
1+𝑒
−𝐻(3),𝑖
and 𝑎(3),𝑖 = 𝑓(3),𝑖  
Layer 4: The output of each node in this layer will be used as the binary codes. During training,
these codes are used to reconstruct the input data at the output layer. The hyperbolic tangent function is
particularly used as activation function in this layer. Let 𝑥(4),𝑖 = 𝑎(3),𝑖 and 𝐻(4),𝑖 = ∑𝑤(4),𝑖 𝑥(4),𝑖.
The functions of the 𝑖th node are defined as;
𝑓(4),𝑖 =
𝑒
𝐻(4),𝑖−𝑒
−𝐻(4),𝑖
𝑒
𝐻(4),𝑖+𝑒
−𝐻(4),𝑖
and 𝑎(4),𝑖 = 𝑓(4),𝑖  

Layer 5: This layer is the output or reconstruction layer. To preserve the similarity between samples,
thus, the target outputs are given the same as the inputs of layer 1. Let 𝑥(5),𝑖 = 𝑎(4),𝑖 and 𝐻(5),𝑖 =
∑𝑤(5),𝑖 𝑥(5),𝑖. The functions of the 𝑖th output node are defined as;
𝑓(5),𝑖 = 𝐻(5),𝑖 and 𝑎(5),𝑖 = 𝑓(5),𝑖  

To achieve the efficient binary codes, we include constraints in the objective function so that
the codes have 4 properties: (1) belonging to {1, -1}, (2) similarity-preserving, (3) independent, and (4)
balancing. In this paper, the method presented in the UH-BDNN [23] is applied to optimize the objective
function which is defined as follows:
𝑚𝑖𝑛𝑊,𝑏 𝐿𝑜𝑠𝑠 =
1
2𝑁
‖𝑋 − (𝑊(𝐿−1)𝑌 + 𝑏(𝐿−1) × [1]1×𝑁)‖
2
+
𝜆1
2
∑ ‖𝑊(𝑙)‖
2
𝐿−1
𝑙=1 +
𝜆2
2𝑁
‖𝐻(𝐿−1) − 𝑌‖
2
+
𝜆3
2
‖
1
𝑁
𝐻(𝐿−1)𝐻(𝐿−1)
𝑇
− 𝐼‖
2
+
𝜆4
2𝑁
‖𝐻(𝐿−1)[1]𝑁×1‖
2
(15)
𝑠. 𝑡. 𝑌 ∈ {1, −1}𝐾×𝑁
 
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891
886
where 𝑋 ∈ 𝑅𝐷×𝑁
is a set of 𝑁 training data with 𝐷 dimension, 𝑌 ∈ {1, −1}𝐾×𝑁
is output binary code of 𝑋, 𝐾
is number of bits, 𝐿 is number of layers, 𝑊(𝑙) is weight matrix between layer 𝑙 + 1 and layer 𝑙, 𝑏 is bias vector
for nodes in layer 𝑙 + 1, 𝐻(𝑙) = 𝑓(𝑙)(𝑤(𝑙−1)𝐻(𝑙−1) + 𝑏(𝑙−1)[1]1×𝑛) is the output values of layer 𝑙, 𝐻(1) = 𝑋, 𝑓(𝑙) is
activation function of layer 𝑙, and 𝜆1 - 𝜆4 are the parameters for optimizing the objective function.
The first term of (15) makes sure that the binary code allows a good reconstruction of 𝑋. The second
term is a weight regularization that encourages the network to keep the weights small in order to reduce
overfitting. The third term measures the equality constraint violation. The fourth term is the independence,
and the fifth term is balance of the binary codes. As shown in (16) is to ensure that each bit of the binary
codes belongs to {1, -1}. After the efficient codes are produced from deep-learning network, these codes are
used as search index in our content-based audio retrieval system. The song ID, 𝑡1, 𝑓1, ∆𝑓 and ∆𝑡 are stored at
their hash address in the database. Table 1 shows the representation of information data.
Table 1. Representation of information data
Method Index Information data
USH 16 bits Hash Address Song ID, 𝑡1 , 𝑓
1 , ∆𝑓, ∆𝑡
Shazam [3] 16-bit / 20-bit Hash Address Song ID, 𝑡1
3.3. Sequence matching
For the query step, a sequence of query features is generated to a set of compact hash codes and
used for searching in the inverse lookup database. Let 𝑄 represent a set of sequences of query features,
𝑄 = {𝑞1, 𝑞2, … , 𝑞𝑀}, where 𝑞𝑚 is a query at oreder 𝑚, 𝑚 = 1, 2, …, 𝑀, and 𝑀 is the total number of
sequences. The learning hash function 𝐻: 𝑞𝑚 → ℎ(𝑞𝑚), is used to map the query features to binary hash
codes. We can define 𝑄 = {𝑞𝑚}𝑚=1
𝑀
to the corresponding binary codes as follows:
𝑌 = 𝐻(𝑄) = {ℎ(𝑞1), ℎ(𝑞2), … , ℎ(𝑞𝑀)} 
where 𝑌 is the hash codes of 𝑄, 𝑌 ∈ {1,−1}𝐾𝑥𝑄
, and 𝐾 is the number of bits. After learning deep network,
we obtain a set of items that indexed by the hash address. Let 𝑠𝑚 = {𝒙𝑚1, 𝒙𝑚2, … , 𝒙𝑚𝑛𝑚
} be a set of items of
𝑞𝑚, 𝒙𝑚𝑖 ∈ 𝑅5x1
be the information vector that are stored in the database, and 𝑛𝑚 is the number of items of
𝑞𝑚. Given 𝑆 = {𝑠𝑚}𝑚=1
𝑀
is a set of 𝑠𝑚 where 𝑚 = 1,2,… , 𝑀, the sequence matchingprocess is shown in Figure 9.
A Sequence of Queries
Q = {q1, q2, q3}
q1
q2
q3
Learning
Hash Function
H(q1)
H(q2)
H(q3)
100010...1000
100010...1001
100010...1010
100010...1011
100010...1100
111111...1111
The Reference
Inverse Lookup
Database
[ ]5x1 [ ]5x1
x21 x22
[ ]5x1 [ ]5x1 [ ]5x1 [ ]5x1
x11 x12 x13 x14
[ ]5x1 [ ]5x1 [ ]5x1
x31 x32 x33
S
x11
x12
x14
x21
x22
x31
x32
x13
x33
Information Data
[Song ID, t1, f1,, Δf, Δt]
Most frequently
occurred similar relative
offsets time
Minimum Distance
Song ID
Returned Item
Data Collection
Figure 9. Our proposed sequence matching
Int J Elec & Comp Eng ISSN: 2088-8708 
Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat)
887
As can be seen in Figure 9, assume that 𝑄 = {𝑞1, 𝑞2, 𝑞3}, and get the binary codes 𝐻(𝑄), we obtain a
set of 𝑠𝑚, 𝑠1 = {𝑥11, 𝑥12, 𝑥13, 𝑥14}, 𝑠2 = {𝑥21, 𝑥22}, 𝑠3 = {𝑥31, 𝑥32, 𝑥33}, and 𝑆 = {𝑥11, 𝑥12, 𝑥13, 𝑥14 , 𝑥21, 𝑥22,
𝑥31, 𝑥32, 𝑥33}. One nearest neighbor search 𝑁𝑁(𝑞𝑚) for a queryitem at order 𝑚 from 𝑠𝑚 is defined as follows:
𝑁𝑁(𝑞𝑚) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥𝑚𝑖∈𝑠𝑚
‖𝑥𝑚𝑖 − 𝑞𝑚‖2 
where ‖𝑥𝑚𝑖 − 𝑞𝑚‖2 is the 𝐿2-norm between 𝑥𝑚𝑖 and sequences 𝑞𝑚. Given 𝑅𝑄 = {𝑁𝑁(𝑞𝑚)}𝑚=1
𝑀
is
a candidate items set of 𝑄. We also apply a time offset constraint for improving the accuracy of the sequence
matching. The time offset constraint |𝑇𝑥𝑚
− 𝑇𝑞𝑚
| is the absolute difference between 𝑇𝑥𝑚
and 𝑇𝑞𝑚
. It can be
defined as follows:
|𝑇𝑥1
− 𝑇𝑞1
| = ⋯ = |𝑇𝑥𝑚
− 𝑇𝑞𝑚
| 
where 𝑇𝑥𝑚
and 𝑇𝑞𝑚
are the offset time of reference file 𝑥𝑚 and query 𝑞𝑚, respectively. The constraint of
offset time can be analyzed that a sequence of candidate items should occur with the same absolute
difference among the time sequences. Our proposed has the following procedures.
In summary, the proposed audio retrieval algorithm is based on two parts, the similarity (minimum
distance) between the audio query and the song in the reference database, and the absolute difference among
the time sequences.
Algorithm: Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks
Input:
𝑀 = {𝑚𝑖}𝑖=1
𝑁
𝑄
𝑁 reference set;
query set;
Output:
SID Song ID is returned by proposed algorithm.
Step 1: Extracting fingerprint features 𝑋 = {𝑥𝑖}𝑖=1
𝑁
∈ 𝑅𝐷×𝑁
from the reference dataset.
Step 2: Learning hash where 𝑥𝑖 ∈ 𝑅𝐷
is an input vector. The objective function defined as equations
15-16. In this step, we apply the learning function ℎ(𝑞) for audio fingerprint 𝑞 in step 3.
Step 3: Sequence matching
1. The query sample is divided into 𝑀 fingerprints 𝑄 = {𝑞𝑚}𝑚=1
𝑀
2. 𝑠𝑖 = { }, 𝐴 = { }, 𝑆 = { }, 𝑅𝑄 = { }
for 𝑖 = 1, 2, , 𝑀 do
𝑖𝑛𝑑𝑒𝑥𝑖 = ℎ (𝑞𝑖)
𝑠𝑖 = all items in 𝐴(𝑖𝑛𝑑𝑒𝑥𝑖) are collected into 𝑠𝑖
𝑆 = 𝑆 ∪ 𝑠𝑖
end
for 𝑖 = 1, 2, , 𝑀 do
Aux = maxValue
for 𝑗 = 1, 2, , sizeOf(𝑠𝑖) do
if ‖𝑥𝑗 − 𝑞𝑖‖
2
< Aux
Aux = ‖𝑥𝑗 − 𝑞𝑖 ‖
2
absTime = |𝑇𝑥𝑗
− 𝑇𝑞𝑖
|
SID = 𝑆𝐼𝐷𝑥𝑗
end
end
𝑅𝑄 = 𝑅𝑄 ∪ {𝑆𝐼𝐷, 𝑎𝑏𝑠𝑇𝑖𝑚𝑒}
end
3. Finding max frequency of each Song ID of 𝑅𝑄 where they have same absolute difference
among the time sequences.
4. EXPERIMENTAL AND PERFORMANCE ANALYSIS
4.1. Database
The performance of our proposed USH method is evaluated on the Extended Ballroom dataset freely
available in [24, 25]. The dataset consists of 4,180 musical excerpts of 13 genres with a length of 30 seconds
each. The audio quality of this data is 44.1kHz, 192-kbps, stereo, mp3 format. In this work, the audio signal
is downsampled to 8KHz. to make the data easier to handle as previously mentioned. The training set
(also used as reference database for retrieval) is composed of 3,000 tracks from 8 genres, the same rhythm
class as our previous works [26, 27]. A set of 1,000 audio queries with a length of 10 seconds each are
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891
888
randomly selected from those 3,000 tracks. Another set of 200 audio queries comes from audio files that
do not appear in the database, in order to analyze the false positive rate. Each audio sample is represented by
a 20-dimensional feature vector extracted by fingerprint algorithm. Table 2 shows the number of samples in
the database and the query set.
Table 2. Audio samples in the database and query set
Set Number of tracks Length of segment (s) Number of samples
Database 3,000 30 441,184
Query 1,200 10 67,785
4.2. Performance evaluation
4.2.1. Effectiveness of retrieval
On a total of 1,200 audio queries, the retrieval results obtained from our proposed USH method and
state-of-the-art data-independent method, the Shazam algorithm, are shown in Table 3. The false negative
(FN) refers to the incorrect identification that the query audio does not exist in the database when it does,
true positive (TP) refers to the correct identification of the audio recording from the query, false positive (FP)
refers to the incorrect identification of the wrong recording when the correct recording does not exist in the
database, and true negative (TN) refers to the correct identification that no audio recording matches the query.
According to the experimental results, we obtain higher percentage of accuracy (88.92%) for the proposed USH
than state-of-the-art data-independent method, the Shazam algorithm (71.67% for 16-bit hash code, 87.42%
for 20-bit hash code). Figure 10 shows the retrieval accuracycomparison between the two different methods.
Table 3. Retrieval results comparison between USH and state-of-the-art data-independent method
Method FN TP FP TN
USH 16-bit 114 886 19 181
Shazam 16-bit 290 710 50 150
Shazam 20-bit 132 868 19 181
Figure 10. The retrieval accuracy of the proposed USH and the Shazam algorithm
The effectiveness of the USH is evaluated through the experiments and compared with state-of
the-art data-independent method in terms of precision, recall, F1 score, and false positive rate [28], as follows:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
× 100 (20)

𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃+𝐹𝑁
× 100 (21)
F1 𝑠𝑐𝑜𝑟𝑒 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 x 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
(22)
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 =
𝐹𝑃
𝐹𝑃+𝑇𝑁
× 100 (23)
As can be seen in Table 4, we obtain higher precision and recall values for the proposed USH than
state-of-the-art data-independent method both in 16-bit and 20-bit hash code. The F1 score (in the fourth
column) shows the overall effectiveness of two different methods. Furthermore, the USH has a significantly
Int J Elec & Comp Eng ISSN: 2088-8708 
Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat)
889
lower percentage of false positives (9.50%) than state-of-the-art data-independent method (25.00%) for
the same 16-bit hash code. Its shows the superior performance of the USH on short codes.
Table 4. Effectiveness comparison between USH and state-of-the-art data-independent method
Method Precision Recall F1 score % False positive
USH 16-bit 97.90 88.60 93.02 9.50
Shazam 16-bit 93.42 71.00 80.68 25.00
Shazam 20-bit 97.86 86.80 92.00 9.50
4.2.2. Storage cost
As a result of our proposed USH method, the 20-bit fingerprint features can be mapped into the 16-bit
binary code. Hence, it significantly reduces the database size by 16-fold, resulting in higher search performance.
4.3. Discussion
The performance of the proposed USH for large audio database retrieval is evaluated through
the experiments and compared with state-of-the-art data-independent hashing method, the Shazam algorithm,
on a test set of 3,000 audio recordings. The experimental results support the effectiveness of the USH with
high precision and recall values at 97.90% and 88.60% respectively. For the satisfactory results, the hash
codes produced from our proposed method have similarity preserving property, i.e., the similarity items are
mapped to the same hash code, the dissimilarity items are mapped to another one. The data-independent
methods do not take into account for this property.
The Shazam algorithm has higher percentage of false positives (25.00%) than the USH (9.50%) for
the same 16-bit hash code. It shows that the Shazam algorithm is more likely to give incorrect identifications
for the short-length codes and that is of inferior performance in audio retrieval. Furthermore, if the database
size is increased tremendously in the future, it is most likely that the Shazam algorithm would result
a significant number of false positive matches.
Let’s consider the collection S of the USH and the Shazam algorithm which effect to the accuracy of
audio retrieval. For the Shazam algorithm, the collection S consists of the data items where the search
algorithm tries to find only the matching items for those search queries regardless of similarity preserving,
and this may result in losing a number of relevant data. For the collection S of our proposed USH method
consists of candidate data items where the search algorithm focuses on the similarity between the search
queries and the items in the database. As shown in Table 5, with the collection S of the USH, the Song ID 48
is correctly identified by the two data items with the smallest distance (distance=1) at the same time offset.
For the Shazam algorithm, the relevant song cannot be retrieved.
Table 5. Example of the collection S of the proposed USH and the Shazam algorithm
Method Song ID Time offset Distance Number of item(s)
USH 16-bit 150 128 16 2
1041 239 28 2
936 -232 32 2
48* 152 1 2
306 228 48 2
2453 516 24 2
690 3720 32 1
690 2726 32 1
Shazam 16-bit 2732 164 - 2
847 19 - 2
684 2871 - 1
674 4475 - 1
675 2584 - 1
676 1564 - 1
676 4739 - 1
677 1820 - 1
Shazam 20-bit 1174 52 - 1
1176 557 - 1
1230 218 - 1
706 363 - 1
340 395 - 1
48 153 - 1
93 448 - 1
139 453 - 1
* Refers to the system correctly identifies the audio recording
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891
890
The major factors for our superiority are that 1) the similarity-preserving hash codes produced from
our proposed USH method, and 2) the audio retrieval algorithm proposed composes of 2 metrics, one is
the 𝐿2-norm, and the other is absolute offset time difference. These factors increase the ability to identify
the candidate items according to the similarity level of audio sample and the songs in the reference database.
And this significantly improves the retrieval performance.
5. CONCLUSION
In this paper, an unsupervised similarity-preserving hashing (USH) method for content-based audio
retrieval is proposed. We develop a deep network with multiple hierarchical layers of nonlinear and linear
transformations to learn compact hash codes where the similarity between samples is preserved. The independence
and balance properties are included and optimized in the objective function to improve the codes.
The experimental results on the Extended Ballroom dataset show the superiority of our proposed method
over state-of-the-art data-independent method. It is suggested future work should be focused on extending
USH to supervised hashing by leveraging the semantic labels to enhance the retrieval performance.
REFERENCES
[1] P. Grosche, M. Müller, and J. Serrà, “Audio Content-Based Music Retrieval,” Multimodal Music Processing.
Dagstuhl Follow-Ups, vol. 3, pp. 157-174, 2012.
[2] J. Haitsma, and T. Kalker, “A highly robust audio fingerprinting system with an efficient search strategy,” Journal
of New Music Research, vol. 32, no. 2, pp. 211-222, 2003.
[3] A. L. Wang, “An Industrial-Strength Audio Search Algorithm,” 4th International Conference on Music
Information Retrieval (ISMIR 2003), pp. 7-13, 2003.
[4] P. Panyapanuwat, S. Kamonsantiroj, and L. Pipanmaekaporn, “Time-Frequency Ratio Hashing for Content-Based
Audio Retrieval,” 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi,
pp. 205-210, 2017.
[5] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” 25th International
Conference on Very Large Data Bases (VLDB’99), pp. 518-529, 1999.
[6] B. Kulis, P. Jain, and K. Grauman, “Fast Similarity Search for Learned Metrics,” In: IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2143-2157, 2009.
[7] M. Raginsky, and S. Lazebnik, “Locality-Sensitive Binary Codes from Shift-Invariant Kernels,” 23rd Annual
Conference on Neural Information Processing Systems (NIPS’09), pp. 1509-1517, 2009.
[8] Y. Zheng, J. Zhuand, W. Fangand, and L.-H. Chi, “Deep Learning Hash for Wireless Multimedia Image Content
Security,” Journal of Security and Communication Networks, vol. 2018, pp. 1-13, 2018.
[9] J. Wang, W. Liu, S. Kumar, and S.-F. Chang, “Learning to Hash for Indexing Big Data-A Survey,” Proceedings of
the IEEE, vol. 104, no. 1, pp. 34-57, 2016.
[10] Y. Weiss, A. Torralba, and R. Fergus, “Spectral Hashing,” 21st International Conference on Neural Information
Processing Systems (NIPS’08), pp. 1753-1760, 2008.
[11] B. Kulis, and K. Grauman, “Kernelized Locality-Sensitive Hashing for Scalable Image Search,” 2009 IEEE 12th
International Conference on Computer Vision (ICCV), Kyoto, pp. 2130-2137, 2009.
[12] Y. Gong, and S. Lazebnik, “Iterative Quantization: A Procrustean Approach to Learning Binary Codes,” IEEE
Conference on Computer Vision and Pattern recognition (CVPR 2011), Providence, RI, pp. 817-824, 2011.
[13] B. Kulis, and T. Darrell, “Learning to Hash with Binary Reconstructive Embeddings,” 22nd International
Conference on Neural Information Processing Systems (NIPS’09), pp. 1042-1050, 2009.
[14] W. Liu, J. Wang, R. Ji, Y. G. Jiang, and S. F. Chang, “Supervised Hashing with Kernels,” 2012 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, pp. 2074-2081, 2012.
[15] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming Distance Metric Learning,” 25th International
Conference on Neural Information Processing Systems (NIPS’12), pp. 1061-1069, 2012.
[16] J. Wang, S. Kumar, and S. F. Chang, “Semi-Supervised Hashing for large-Scale Search,” In IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 34, no. 12, pp. 2393-2406, 2012.
[17] F. Shen, C. Shen, W. Liu W, and H. T. Shen, “Supervised Discrete Hashing,” IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2015), pp. 37-45, 2015.
[18] X. Bai, H. Yang, J. Zhou, P. Ren, and J. Cheng, “Data-dependent Hashing Based on p-Stable Distribution,” IEEE
Transactions on Image Processing, vol. 23, no. 12, pp. 5033-5046, 2014.
[19] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A Survey on Learning to Hash,” In IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 769-790, 2018.
[20] J. He, S. F. Chang, R. Radhakrishnan, and C. Bauer, “Compact hashing with joint optimization of search accuracy
and time,” IEEEConference on Computer Vision and Pattern Recognition (CVPR2011),Providence,RI,pp. 753-760, 2011.
[21] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A Review of Audio Fingerprinting,” Journal of VLSI Signal
Processing, vol. 41, pp. 271-284, 2005.
[22] Dan Ellis, “Robust Landmark-Based Audio Fingerprinting,” 2015. [Online]. Available:
https://labrosa.ee.columbia.edu/matlab/fingerprint/.
Int J Elec & Comp Eng ISSN: 2088-8708 
Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat)
891
[23] T. T. Do, A. D. Doan, and N. M. Cheung, “Learning to Hash with Binary Deep Neural Network,” 14th European
Conference on Computer Vision (ECCV 2016), pp. 219-234, 2016.
[24] U. Marchand, and G. Peeters. “Scale and shift invariant time/frequency representation using auditory statistics:
Application to rhythm description,” 2016 IEEE 26th International Workshop on Machine Learning for Signal
Processing (MLSP), Vietri sul Mare, pp. 1-6, 2016.
[25] U. Marchand, and G. Peeters, “The Extended Ballroom Dataset,” 17th International Society for Music Information
Retrieval Conference (ISMIR 2016) Late-Breaking Session, New-York, USA, pp. 1-3, 2016.
[26] P. Panyapanuwat, S. Kamonsantiroj, and L. Pipanmaekaporn, “Unsupervised Learning Hash for Content-Based
Audio Retrieval Using Deep Neural Networks,” 2019 11th International Conference on Knowledge and Smart
Technology (KST), Phuket, Thailand, pp. 99-104, 2019.
[27] P. Panyapanuwat, and S. Kamonsantiroj, “Performance Comparison of Unsupervised Deep Hashing with Data-
independent Hashing for Content-Based Audio Retrieval,” 2019 2nd International Conference on Electronics,
Communications and Control Engineering, pp. 16-20, 2019.
[28] C. Manning, P. Raghavan, and H. Schütze, “An Introduction to Information Retrieval,” Cambridge University
Press, 2009.
BIOGRAPHIES OF AUTHORS
Petcharat Panyapanuwat holds a bachelor’s in mathematics and a master’s degree in software
engineering. She is currently a Ph.D. candidate at Department of Computer and Information
Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand.
Her current research interest focuses on music information retrieval.
Suwatchai Kamonsantiroj is currently a lecturer at Department of Computer and Information
Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand.
He holds a bachelor’s degree in mechanical engineering and a master’s degree in information
technology management. He also earned his doctoral degree in computer engineering from
Kasetsart University, Thailand, graduating in 2008. His current research interests include neural
network, time series analysis, and artificial intelligence.
Luepol Pipanmaekaporn is currently a lecturer at Department of Computer and Information
Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand.
He holds both a bachelor’s and a master’s degree in computer science. He also earned his
doctoral degree in computer science from Queensland University of Technology, Australia,
graduating in 2013. His current research interests include information retrieval, web mining, and
data mining.

More Related Content

PPTX
Image to text Converter
PDF
Text Extraction from Image using Python
PDF
Image-Based Literal Node Matching for Linked Data Integration
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
Pioneering VDT Image Compression using Block Coding
PDF
Different Steganography Methods and Performance Analysis
PDF
Digital Watermarking through Embedding of Encrypted and Arithmetically Compre...
PDF
AN INNOVATIVE IDEA FOR PUBLIC KEY METHOD OF STEGANOGRAPHY
Image to text Converter
Text Extraction from Image using Python
Image-Based Literal Node Matching for Linked Data Integration
International Journal of Engineering and Science Invention (IJESI)
Pioneering VDT Image Compression using Block Coding
Different Steganography Methods and Performance Analysis
Digital Watermarking through Embedding of Encrypted and Arithmetically Compre...
AN INNOVATIVE IDEA FOR PUBLIC KEY METHOD OF STEGANOGRAPHY

What's hot (20)

PDF
Networkx tutorial
PPTX
Text extraction from images
PDF
A bidirectional text transcription of braille for odia, hindi, telugu and eng...
PPTX
A Fast and Dirty Intro to NetworkX (and D3)
PDF
Optical Character Recognition
PDF
Quality Measurements of Lossy Image Steganography Based on H-AMBTC Technique ...
PPTX
BrailleOCR: An Open Source Document to Braille Converter Application
PDF
Improvement in Traditional Set Partitioning in Hierarchical Trees (SPIHT) Alg...
PDF
An approach to hide data in video using steganography
PDF
Enhancement of DES Algorithm with Multi State Logic
PDF
Ijcnc050213
PPTX
Texture features based text extraction from images using DWT and K-means clus...
PDF
Data structure-question-bank
PDF
Self-Directing Text Detection and Removal from Images with Smoothing
PDF
Is3314841490
PDF
gilbert_iccv11_paper
PDF
Ieee a secure algorithm for image based information hiding with one-dimension...
PDF
IRJET- Image Captioning using Multimodal Embedding
PDF
Dictionary based Image Compression via Sparse Representation
PDF
Relaxing global-as-view in mediated data integration from linked data
Networkx tutorial
Text extraction from images
A bidirectional text transcription of braille for odia, hindi, telugu and eng...
A Fast and Dirty Intro to NetworkX (and D3)
Optical Character Recognition
Quality Measurements of Lossy Image Steganography Based on H-AMBTC Technique ...
BrailleOCR: An Open Source Document to Braille Converter Application
Improvement in Traditional Set Partitioning in Hierarchical Trees (SPIHT) Alg...
An approach to hide data in video using steganography
Enhancement of DES Algorithm with Multi State Logic
Ijcnc050213
Texture features based text extraction from images using DWT and K-means clus...
Data structure-question-bank
Self-Directing Text Detection and Removal from Images with Smoothing
Is3314841490
gilbert_iccv11_paper
Ieee a secure algorithm for image based information hiding with one-dimension...
IRJET- Image Captioning using Multimodal Embedding
Dictionary based Image Compression via Sparse Representation
Relaxing global-as-view in mediated data integration from linked data
Ad

Similar to Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks (20)

PPT
20140327 - Hashing Object Embedding
PDF
Supervised Quantization for Similarity Search (camera-ready)
PDF
Hash Coding
PDF
Multiview Alignment Hashing for Efficient Image Search
PPT
20140702 xu jiaming hashinglearning - lite
PDF
large_scale_search.pdf
PDF
ENTROPY OPTIMIZED FEATURE-BASED BAG-OF-WORDS REPRESENTATION FOR INFORMATION R...
PDF
A Hybrid Procreative –Discriminative Based Hashing Method
PDF
5 efficient-matching.ppt
PDF
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
PDF
Finding similar items in high dimensional spaces locality sensitive hashing
PDF
Bytewise Approximate Match: Theory, Algorithms and Applications
PDF
Chromatic Sparse Learning
PDF
[241]large scale search with polysemous codes
PPT
PDF
Probabilistic data structures. Part 4. Similarity
PDF
Binary Similarity : Theory, Algorithms and Tool Evaluation
PDF
IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...
PPTX
3 - Finding similar items
PPT
similarity1 (6).ppt
20140327 - Hashing Object Embedding
Supervised Quantization for Similarity Search (camera-ready)
Hash Coding
Multiview Alignment Hashing for Efficient Image Search
20140702 xu jiaming hashinglearning - lite
large_scale_search.pdf
ENTROPY OPTIMIZED FEATURE-BASED BAG-OF-WORDS REPRESENTATION FOR INFORMATION R...
A Hybrid Procreative –Discriminative Based Hashing Method
5 efficient-matching.ppt
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Finding similar items in high dimensional spaces locality sensitive hashing
Bytewise Approximate Match: Theory, Algorithms and Applications
Chromatic Sparse Learning
[241]large scale search with polysemous codes
Probabilistic data structures. Part 4. Similarity
Binary Similarity : Theory, Algorithms and Tool Evaluation
IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...
3 - Finding similar items
similarity1 (6).ppt
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...

Recently uploaded (20)

PPTX
additive manufacturing of ss316l using mig welding
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
OOP with Java - Java Introduction (Basics)
additive manufacturing of ss316l using mig welding
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Lesson 3_Tessellation.pptx finite Mathematics
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Structs to JSON How Go Powers REST APIs.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
573137875-Attendance-Management-System-original
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Embodied AI: Ushering in the Next Era of Intelligent Systems
CH1 Production IntroductoryConcepts.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
UNIT 4 Total Quality Management .pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Geodesy 1.pptx...............................................
Arduino robotics embedded978-1-4302-3184-4.pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
OOP with Java - Java Introduction (Basics)

Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 11, No. 1, February 2021, pp. 879~891 ISSN: 2088-8708, DOI: 10.11591/ijece.v11i1.pp879-891  879 Journal homepage: http://ijece.iaescore.com Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks Petcharat Panyapanuwat, Suwatchai Kamonsantiroj, Luepol Pipanmaekaporn Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Thailand Article Info ABSTRACT Article history: Received Jan 1, 2020 Revised Jun 8, 2020 Accepted Aug 18, 2020 Due to its efficiency in storage and search speed, binary hashing has become an attractive approach for a large audio database search. However, most existing hashing-based methods focus on data-independent scheme where random linear projections or some arithmetic expression are used to construct hash functions. Hence, the binary codes do not preserve the similarity and may degrade the search performance. In this paper, an unsupervised similarity-preserving hashing method for content-based audio retrieval is proposed. Different from data-independent hashing methods, we develop a deep network to learn compact binary codes from multiple hierarchical layers of nonlinear and linear transformations such that the similarity between samples is preserved. The independence and balance properties are included and optimized in the objective function to improve the codes. Experimental results on the Extended Ballroom dataset with 8 genres of 3,000 musical excerpts show that our proposed method significantly outperforms state-of- the-art data-independent method in both effectiveness and efficiency. Keywords: Content-based audio retrieval Deep learning Deep neural networks Similarity-preserving hash Unsupervised learning This is an open access article under the CC BY-SA license. Corresponding Author: Petcharat Panyapanuwat, Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand. Email: panyapetch@hotmail.com 1. INTRODUCTION With rapidly growing database of digital audio recordings, the novel retrieval strategies have received great attention. Early retrieval approach uses textual metadata describing the content of music audio (e.g., artist name, song title, album name, genre, or release year of music). In case such descriptions are not available, it is required content-based retrieval strategy that the perceptual aspects of the audio are utilized. [1]. Content-based audio retrieval approach is generally solved with two steps: first, features are extracted from the audio file and then used to build indexes for searching. Two main issues of performing a search over a large database are search speed and efficient storage. The most interesting approach for handling these problems is binary hashing, where the high-dimensional features are encoded into compact binary codes. There have been several hashing methods proposed in the literature. They can be devided into two categories, data-independent methods and data-dependent methods. Methods in data-independent category [2-7] use random linear projections or some arithmetic expression to construct hash functions. Without the training process, they are robust to data variation. However, such methods require long hash codes to achieve high precision. This increases the storage cost and degrades the search efficiency [8]. Methods in data-dependent category, also called learning to hash methods, aim to learn a set of hash functions from available training data that yield compact codes to achieve satisfactory search performance [9]. Existing data-dependent methods can be classified into unsupervised, supervised, and semi-supervised
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891 880 learning approach. Unsupervised hashing methods [10-12] use unlabeled data to build the hash functions where the neighbor distance (e.g., L2 norm) among the training data is preserved. Supervised or semi- supervised hashing methods [13-17] attemp to improve the quality of hashing by leveraging the semantic labels into the learning process. Compared with data-independent methods, it appears that data-dependent methods can achieve better accuracy with shorter codes [12, 14, 17]. However, data-dependent methods may be too dependent on the training data [18]. There are both advantages and shortcomings of using data-independent and data-dependent methods. However, the previous works of the two categories do not fully take into consideration the similarity preserving and this may degrade the retrieval performance. In this work, an unsupervised similarity-preserving hashing method for content-based audio retrieval is proposed. We develop a deep network with several hierarchical layers of nonlinear and linear transformations to learn compact binary codes where the similarity between samples is preserved. Furthermore, the independence and balance properties are included in the objective function to improve the codes. The proposed method is compared with the Shazam algorithm [3], the data-independent hashing method, in terms of accuracy, precision, recall, false positive rate, and the storage cost. 2. BACKGROUND 2.1. Learning to hash Learning to hash attempts to learn a hash function 𝑦 = ℎ(𝑥) that maps a high-dimensional input item 𝑥 ∈ 𝑅𝐷 to a compact code 𝑦, aiming to improve the search performance [19]. There are 4 topics to consider for the learning to hash: (1) hash function, (2) similarity-preserving, (3) loss function, and (4) deep learning to hash. 2.1.1. Hash function There are several ways to design hash functions. The most widely used hash functions are generalized by linear projection as shown in (1). 𝑦 = ℎ(𝑋) = sgn(𝑓(𝑊𝑇 𝑋 + 𝑏))   where 𝑦 ∈ {0,1} or {−1,1}, 𝑋 = {x𝑛}𝑛=1 𝑁 ∈ 𝑅𝐷x𝑁 is the training set which contains 𝑁 samples, 𝐷 is the dimension of input vector, 𝑊 = {𝑤𝑘}𝑘=1 𝐾 ∈ 𝑅𝐷x𝐾 is the projection vector, 𝐾 is number of hash bits, 𝑏 is the bias variable, sgn(𝑧) = −1 or 0 if 𝑧 < 0 and sgn(𝑧) = 1 otherwise, 𝑓(∙) is a predefined function which can possibly be neural networks or nonlinear function. However, using different 𝑓(∙) yields different hash function properties. 2.1.2. Similarity-preserving The distance 𝑑𝑖𝑗 between two items 𝑥𝑖 and 𝑥𝑗 can be defined by the standardized Euclidean distance ‖𝑥𝑖 − 𝑥𝑗‖ 2 or others. The similarity 𝑠𝑖𝑗 between those items is often defined as a function of the distance 𝑑𝑖𝑗 (e.g., Gaussian function, cosine similarity, and so on). In addition, the semantic similarity approach is generally used in similarity search application. We can apply any distance to the hashing algorithm for semantic similarity, such as Euclidean distance, by defining semantic similarity 𝑠𝑖𝑗 = 1 for adjacent points and 𝑠𝑖𝑗 = 0 or −1 for farther points. In the hash coding space, the Hamming distance 𝑑𝑖𝑗 𝐻 between the code 𝑦𝑖 and 𝑦𝑗 can be defined as ‖𝑦𝑖 − 𝑦𝑗‖ 1 = ∑ ‖ℎ𝑘(𝑥𝑖) − ℎ𝑘(𝑥𝑗)‖ 𝐾 𝑘=1 . It is the number of binary digits where the values are different. Hamming similarity is defined as 𝑠𝑖𝑗 𝐻 = 𝐾 − 𝑑𝑖𝑗 𝐻 for the codes valued by 1 and 0. For the codes valued by 1 and -1, the inner product 𝑠𝑖𝑗 𝐻 = 𝑦𝑖 𝑇 𝑦𝑗 is defined as the similarity. Let’s focus on the term of similarity preserving. In Figure 1(a), there is a set of three points (𝑥1, 𝑥2, and 𝑥3) in an input space. By measuring the Euclidean distance between the points, we can find that 𝑥1 is closer to 𝑥2 than to 𝑥3, i.e., 𝑥1 is more similar to 𝑥2 than 𝑥3. The ℎ(𝑥1), ℎ(𝑥2), and ℎ(𝑥3) are the representations of 𝑥1, 𝑥2, and 𝑥3 in the hash coding space (or Hamming space), respectively. From the Figure 1(b), we can see ℎ(𝑥1) is closer to ℎ(𝑥3) while ℎ(𝑥2) is far away. In this case, it shows that the similarities are not preserved. Figure 1(c), on the other hand, shows an example of the similarities that are well preserved.
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat) 881 Similar x1 x2 x3 Dissimilar Hamming Space (2-dim) h(x2) h(x3) 0 1 1 h(x1) Hamming Space (2-dim) h(x1) h(x2) h(x3) 0 1 1 Similarities are not preserved Similarities are well preserved Hashing Hashing (a) (b) (c) Figure 1. Similarity-preserving hashing 2.1.3. Loss function The loss function is intended to preserve the similarity order, i.e., minimize the difference between the nearest neighbor search result in the hash coding space and the search result in the input space. The loss function 𝐿𝑜𝑠𝑠(𝑋, 𝑊) is defined as follows: 𝐿𝑜𝑠𝑠(𝑋, 𝑊) = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ ‖𝑑𝑖𝑗 − 𝑑𝑖𝑗 𝐻 ‖ 2 𝑁 𝑥𝑖,𝑥𝑗∈𝑋   where 𝑋 is the input data, and 𝑊 is the projection vector. Specifically, 𝑦𝑖 = ℎ(𝑥𝑖) needs to be binary. This binary constraint leads to a difficult optimization problem. To solve the problem, we drop the binary constraint and let the codes be continuous. The codes are then binarized with thresholding. For binary constraint relaxation, various standard optimization techniques can be applied. 2.1.4. Deep learning to hash The goal of learning to hash is to learn the specific hash functions that map high dimensional input vector to a compact binary vector that yields a good quality of retrieval and search speed [20]. For unlabeled data, an illustration of unsupervised deep learning to hash model that map the input vector 𝑥 ∈ 𝑅𝐷 to compact binary codes is shown in Figure 2. w(1) w(L -1) w(L) ... ... ... Layer 1 Layer 2 Layer L-1 Layer L Layer L+1 (Input Layer) (Reconstruction Layer) D x R  ˆ D x R  Figure 2. Unsupervised deep learning model
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891 882 Assume that an unsupervised deep network consists of L+1 layers. A binary vectoy 𝑦𝑖 is generated by passing the input vector 𝑥𝑖 through the network that contains multiple hierarchical layers of nonlinear functions. The binary code of 𝑥𝑖 at Lth layer can be calculated as follows: 𝑦𝑖 = ℎ(𝑥𝑖) = 𝑠𝑔𝑛(𝐹(𝑥𝑖,𝑊))  where 𝐹(𝑥𝑖, 𝑊) is a composition of nonlinear transformations defined as follows: 𝐹(𝑥𝑖, 𝑊) = 𝑓𝐿(∙∙∙ 𝑓2 (𝑓 1(𝑥𝑖, 𝑤(1)), 𝑤(2)) ∙∙∙ 𝑤(𝐿))  where the vector 𝑥𝑖 and the weight vector 𝑤(𝑙) are used as input, the projection 𝑥𝑖+1 is produced by 𝑓𝑖(∙). The learning algorithm aims to learn a set of nonlinear weight vectors 𝑊 = {𝑤(1), …, 𝑤(𝐿)} where the information from the input space is preserved. 2.2. Search with hashing There are two strategies to perform a search with hashing, hash code ranking and hash table lookup [19]. For the hash code ranking, an exhaustive search is performed by comparing the distance (e.g., Hamming distance) between the query and the reference items. The items with the smallest distances, called nearest neighbors, are retrieved. However, the cost of computing the distance results in performance degradation. The alternative approach, hash table lookup aims to accelerate the search by reducing the distance computations. The inverse lookup database, called hash table, is composed of buckets which are indexed by the hash codes. Given the query, the matching items storing in the bucket are retrieved. 2.3. Audio fingerprinting Audio fingerprinting is best known for its ability to identify an uknown audio recording by using its compact content-based signature so-called fingerprint [21]. It does this by converting the audio features into hash codes, aiming to uniquely identify an audio recording. The advantage of fingerprint is that, it reduces storage costs as fingerprint is relatively small. Moreover, the perceptual irrelevancies have been removed from fingerprint, resulting in efficient comparison and searching. 3. METHOD The aim of this paper is to provide an efficient technique that yields a good quality of retrieval and computational efficiency. In this work, compact binary codes are learned for fingerprint indexing with unsupervised deep network in a way that the similarity between samples is preserved. Once, a short audio sample is taken to our content-based audio retrieval system, the system performs database lookup for matching track and then returns the song ID that the query is taken. As shown in Figure 3, the system is designed with three steps: (1) Fingerprint feature extraction, (2) Unsupervised similarity-preserving hashing, and (3) Sequence matching. Reference Audio Unsupervised Similarity-Preserving Hashing Deep Learning to Hash Hash Functions 16-bit Hash X1,X3 X2 ... Xn Items Database 16-bit Hash Sequence Matching Song ID Returned Item Query Audio 0 20 40 60 100 80 120 140 160 180 Frequency 500 1000 1500 2500 2000 3000 3500 4000 (t2,f2) (t1,f1) Δf = f2 - f1, Δt = t2 - t1 Features = [f1, Δf, Δt] Time 0 20 40 60 100 80 120 140 160 180 Frequency 500 1000 1500 2500 2000 3000 3500 4000 (t2,f2) (t1,f1) Candidate Point Δf = f2 - f1, Δt = t2-t1 Features = [f1, Δf, Δt] Feature Extraction Feature Extraction h1 2 h1 3 h1 19 h2 2 h2 18 h3 16 x3 x4 x2 x1 h1 1 h2 1 h3 1 ... ... ... ... ... 1 ˆx 2 ˆx 3 ˆx 4 ˆx 20 x̂ x20 Candidate Point Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Input Layer Reconstruction Layer Target Region Target Region Figure 3. The construction of proposed method for our content-based audio retrieval system
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat) 883 3.1. Fingerprint feature extraction Before fingerprint feature extraction is performed, the audio signal is converted into a common format for analysis. Next, the time-series audio signal is converted into time-frequency domain from which more meaningful information can be extracted. Below each aspect is detailed. 3.1.1. Preprocessing and transform In this paper, the fingerprint extraction presented in [3] is applied. Firstly, we convert the input audio to mono signal and downsampled from the standard digital audio of 44.1KHz, to 8KHz, to make the data easier to handle, reducing database size, and increasing speed of the algorithm. The audio signal is then converted into the time-frequency representation. We perform a short-time fourier transform (STFT) with a window size of 64 ms. for good spectrum resolution [22] and a hop size of 32 ms. Figure 4 shows time-frequency graph so-called a spectrogram. On the horizontal axis is time, on the vertical is frequency, and on the third is intensity. Each point on the graph represents the intensityof a given frequency at specific time. Figure 4. Spectrogram with peak intensities 3.1.2. Feature extraction After converting the signal into the time-frequency domain, the features are then extracted from the spectrum. Due to their robustness to noise and distortions, the amplitude peaks in each frame are selected as candidate points. Each candidate point is paired with the adjacent peaks. The constellation map of paired points with coordinate list is shown in Figure 5. In this work, each candidate point is paired within 31 frequency bins and 63 time frames. Only the closest 3 peaks in time to each other are selected. Figure 6 shows the combinatorial association of a pair of two points which is called a ‘landmark’. For each pair, it consists of four components, the starting frequency 𝑓 1 , the startingtime 𝑡1, the end frequency 𝑓2, and the end time 𝑡2. Figure 5. A constellation map of paired points (t2,f2) Time Frequency (t1,f1) Candidate Point 31 Frequency Bin 63 Time Frame Δf = f2 - f1, Δt = t2 - t1 Feature = [f1, Δf, Δt] 31 Frequency Bin Figure 6. The combinatorial association of a pair of two points
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891 884 3.1.3. Audio fingerprint For the landmark as mentioned above, the audio fingerprint can be defined as follows: 𝐹𝑖𝑛𝑔𝑒𝑟𝑝𝑟𝑖𝑛𝑡 = [𝑓 1, Δ 𝑓, Δ𝑡]  where the frequency difference ∆𝑓 = 𝑓2 − 𝑓1, and the time difference between the two points ∆𝑡 = 𝑡2 − 𝑡1. The fingerprint is also associated with the offset time fromthe beginningof the audio file to the startingtime 𝑡1. This fingerprint feature [𝑓 1 , Δ 𝑓, Δ𝑡] is used to generate hash code in [3]. The hash model can be defined as shown in (6). 𝑓1 × 212 + ∆𝑓 × 26 + ∆𝑡  where a fingerprint hash is composed of 8-bit frequency 𝑓1, 6-bit frequency difference ∆𝑓, and 6-bit time difference ∆𝑡. Figure 7 shows an example of 20-bit hash address calculated from (6). [216, 18, 36] Hash Function 885924 = Hash Address 12 6 1 2 2 f f t       11011000010010100100 1, [ , ] f f t   Figure 7. An example of 20-bit hash address For a 16-bit fingerprint hash, it is composed of 6-bit frequency 𝑓 1 , 5-bit frequency difference ∆𝑓, and 5-bit time difference ∆𝑡. The hash model can be defined as shown in (7). 𝑓1 × 210 + ∆𝑓 × 25 + ∆𝑡 (7) After the hash code is calculated, the system then uses this code as an index for searching in the database. An exact matching algorithm is applied in [3]. Unlike the Shazam algorithm, we develop a deep neural network with multiple hierarchical layers of nonlinear and linear transformations to learn compact codes from these fingerprint features such that the similarity between samples is preserved. The details are described further in the next section. 3.2. Unsupervised similarity-preserving hashing (USH) In this paper, the hash transformations are created by an unsupervised deep neural network. As shown in Figure 8, there are 5 layers in our deep network: the input layer consists of 20 nodes of input 𝑥𝑖, the three hidden layers consist of 19, 18, and 16 nodes respectively, and there are 20 nodes of 𝑥𝑖 ̂ in the output layer. ... ... ... ... Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 ... 20 20 18 19 16 (1), (1), i i f x  (1), (1), i i a f  (2), (2), 1 1 i i H f e   (2), (2), i i a f  (3), (3), 1 1 i i H f e   (3), (3), i i a f  (4), (4), (4), (4), (4), i i i i i H H H H e e f e e      (4), (4), i i a f  (5), (5), i i f H  (5), (5), i i a f  Figure 8. Our proposed unsupervised similarity-preserving hashing network (USH)
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat) 885 Our deep network is learned so that the output of the fourth layer can be used as the binary hash codes. For the network design, each node is composed of one input summation function and one output transformation function. The function f(∙) is used to combine information by the links from other nodes, as shown in (8). 𝑛𝑜𝑑𝑒𝑖𝑛 = 𝑓(𝑥(𝑙),1 , 𝑥(𝑙),2 , … , 𝑥(𝑙),𝑛 ; 𝑤(𝑙),1 , 𝑤(𝑙),2 , …, 𝑤(𝑙),𝑛) (8) where 𝑥(𝑙),1 , 𝑥(𝑙),2 , … , 𝑥(𝑙),𝑛 are the inputs to the node, 𝑤(𝑙),1 , 𝑤(𝑙),2 , …, 𝑤(𝑙),𝑛 are the associated weights, 𝑙 indicates the layer number, and 𝑛 is the number of input nodes. The output activation function 𝑎(𝑓(∙)) is shown in (9). 𝑛𝑜𝑑𝑒𝑜𝑢𝑡 = 𝑎(𝑙)(𝑛𝑜𝑑𝑒𝑖𝑛) = 𝑎(𝑙)𝑓(∙) (9) Let 𝐻𝑖 be the sum of product of input 𝑥𝑖 and weight 𝑤𝑖, 𝐻𝑖 = ∑𝑤𝑖 𝑥𝑖, the functions of the nodes from layer 1 to layer 5 of our proposed network are defined as follows: Layer 1: In this layer, the nodes only convey inputs to the nodes of the next consecutive layer. The functions of the 𝑖th node are shown as; 𝑓(1),𝑖 = 𝑥(1),𝑖 and 𝑎(1),𝑖 = 𝑓(1),𝑖  Layer 2: For this layer, the sigmoid function is used as activation function. Let 𝑥(2),𝑖 = 𝑎(1),𝑖 and 𝐻(2),𝑖 = ∑𝑤(2),𝑖 𝑥(2),𝑖. Thus, the functions of the 𝑖th node are defined as; 𝑓(2),𝑖 = 1 1+𝑒 −𝐻(2),𝑖 and 𝑎(2),𝑖 = 𝑓(2),𝑖    Layer 3: In this layer, the sigmoid function is applied for activation function. Let 𝑥(3),𝑖 = 𝑎(2),𝑖 and 𝐻(3),𝑖 = ∑𝑤(3),𝑖 𝑥(3),𝑖. The functions of the 𝑖th node are defined as; 𝑓(3),𝑖 = 1 1+𝑒 −𝐻(3),𝑖 and 𝑎(3),𝑖 = 𝑓(3),𝑖   Layer 4: The output of each node in this layer will be used as the binary codes. During training, these codes are used to reconstruct the input data at the output layer. The hyperbolic tangent function is particularly used as activation function in this layer. Let 𝑥(4),𝑖 = 𝑎(3),𝑖 and 𝐻(4),𝑖 = ∑𝑤(4),𝑖 𝑥(4),𝑖. The functions of the 𝑖th node are defined as; 𝑓(4),𝑖 = 𝑒 𝐻(4),𝑖−𝑒 −𝐻(4),𝑖 𝑒 𝐻(4),𝑖+𝑒 −𝐻(4),𝑖 and 𝑎(4),𝑖 = 𝑓(4),𝑖    Layer 5: This layer is the output or reconstruction layer. To preserve the similarity between samples, thus, the target outputs are given the same as the inputs of layer 1. Let 𝑥(5),𝑖 = 𝑎(4),𝑖 and 𝐻(5),𝑖 = ∑𝑤(5),𝑖 𝑥(5),𝑖. The functions of the 𝑖th output node are defined as; 𝑓(5),𝑖 = 𝐻(5),𝑖 and 𝑎(5),𝑖 = 𝑓(5),𝑖    To achieve the efficient binary codes, we include constraints in the objective function so that the codes have 4 properties: (1) belonging to {1, -1}, (2) similarity-preserving, (3) independent, and (4) balancing. In this paper, the method presented in the UH-BDNN [23] is applied to optimize the objective function which is defined as follows: 𝑚𝑖𝑛𝑊,𝑏 𝐿𝑜𝑠𝑠 = 1 2𝑁 ‖𝑋 − (𝑊(𝐿−1)𝑌 + 𝑏(𝐿−1) × [1]1×𝑁)‖ 2 + 𝜆1 2 ∑ ‖𝑊(𝑙)‖ 2 𝐿−1 𝑙=1 + 𝜆2 2𝑁 ‖𝐻(𝐿−1) − 𝑌‖ 2 + 𝜆3 2 ‖ 1 𝑁 𝐻(𝐿−1)𝐻(𝐿−1) 𝑇 − 𝐼‖ 2 + 𝜆4 2𝑁 ‖𝐻(𝐿−1)[1]𝑁×1‖ 2 (15) 𝑠. 𝑡. 𝑌 ∈ {1, −1}𝐾×𝑁  
  • 8.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891 886 where 𝑋 ∈ 𝑅𝐷×𝑁 is a set of 𝑁 training data with 𝐷 dimension, 𝑌 ∈ {1, −1}𝐾×𝑁 is output binary code of 𝑋, 𝐾 is number of bits, 𝐿 is number of layers, 𝑊(𝑙) is weight matrix between layer 𝑙 + 1 and layer 𝑙, 𝑏 is bias vector for nodes in layer 𝑙 + 1, 𝐻(𝑙) = 𝑓(𝑙)(𝑤(𝑙−1)𝐻(𝑙−1) + 𝑏(𝑙−1)[1]1×𝑛) is the output values of layer 𝑙, 𝐻(1) = 𝑋, 𝑓(𝑙) is activation function of layer 𝑙, and 𝜆1 - 𝜆4 are the parameters for optimizing the objective function. The first term of (15) makes sure that the binary code allows a good reconstruction of 𝑋. The second term is a weight regularization that encourages the network to keep the weights small in order to reduce overfitting. The third term measures the equality constraint violation. The fourth term is the independence, and the fifth term is balance of the binary codes. As shown in (16) is to ensure that each bit of the binary codes belongs to {1, -1}. After the efficient codes are produced from deep-learning network, these codes are used as search index in our content-based audio retrieval system. The song ID, 𝑡1, 𝑓1, ∆𝑓 and ∆𝑡 are stored at their hash address in the database. Table 1 shows the representation of information data. Table 1. Representation of information data Method Index Information data USH 16 bits Hash Address Song ID, 𝑡1 , 𝑓 1 , ∆𝑓, ∆𝑡 Shazam [3] 16-bit / 20-bit Hash Address Song ID, 𝑡1 3.3. Sequence matching For the query step, a sequence of query features is generated to a set of compact hash codes and used for searching in the inverse lookup database. Let 𝑄 represent a set of sequences of query features, 𝑄 = {𝑞1, 𝑞2, … , 𝑞𝑀}, where 𝑞𝑚 is a query at oreder 𝑚, 𝑚 = 1, 2, …, 𝑀, and 𝑀 is the total number of sequences. The learning hash function 𝐻: 𝑞𝑚 → ℎ(𝑞𝑚), is used to map the query features to binary hash codes. We can define 𝑄 = {𝑞𝑚}𝑚=1 𝑀 to the corresponding binary codes as follows: 𝑌 = 𝐻(𝑄) = {ℎ(𝑞1), ℎ(𝑞2), … , ℎ(𝑞𝑀)}  where 𝑌 is the hash codes of 𝑄, 𝑌 ∈ {1,−1}𝐾𝑥𝑄 , and 𝐾 is the number of bits. After learning deep network, we obtain a set of items that indexed by the hash address. Let 𝑠𝑚 = {𝒙𝑚1, 𝒙𝑚2, … , 𝒙𝑚𝑛𝑚 } be a set of items of 𝑞𝑚, 𝒙𝑚𝑖 ∈ 𝑅5x1 be the information vector that are stored in the database, and 𝑛𝑚 is the number of items of 𝑞𝑚. Given 𝑆 = {𝑠𝑚}𝑚=1 𝑀 is a set of 𝑠𝑚 where 𝑚 = 1,2,… , 𝑀, the sequence matchingprocess is shown in Figure 9. A Sequence of Queries Q = {q1, q2, q3} q1 q2 q3 Learning Hash Function H(q1) H(q2) H(q3) 100010...1000 100010...1001 100010...1010 100010...1011 100010...1100 111111...1111 The Reference Inverse Lookup Database [ ]5x1 [ ]5x1 x21 x22 [ ]5x1 [ ]5x1 [ ]5x1 [ ]5x1 x11 x12 x13 x14 [ ]5x1 [ ]5x1 [ ]5x1 x31 x32 x33 S x11 x12 x14 x21 x22 x31 x32 x13 x33 Information Data [Song ID, t1, f1,, Δf, Δt] Most frequently occurred similar relative offsets time Minimum Distance Song ID Returned Item Data Collection Figure 9. Our proposed sequence matching
  • 9. Int J Elec & Comp Eng ISSN: 2088-8708  Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat) 887 As can be seen in Figure 9, assume that 𝑄 = {𝑞1, 𝑞2, 𝑞3}, and get the binary codes 𝐻(𝑄), we obtain a set of 𝑠𝑚, 𝑠1 = {𝑥11, 𝑥12, 𝑥13, 𝑥14}, 𝑠2 = {𝑥21, 𝑥22}, 𝑠3 = {𝑥31, 𝑥32, 𝑥33}, and 𝑆 = {𝑥11, 𝑥12, 𝑥13, 𝑥14 , 𝑥21, 𝑥22, 𝑥31, 𝑥32, 𝑥33}. One nearest neighbor search 𝑁𝑁(𝑞𝑚) for a queryitem at order 𝑚 from 𝑠𝑚 is defined as follows: 𝑁𝑁(𝑞𝑚) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥𝑚𝑖∈𝑠𝑚 ‖𝑥𝑚𝑖 − 𝑞𝑚‖2  where ‖𝑥𝑚𝑖 − 𝑞𝑚‖2 is the 𝐿2-norm between 𝑥𝑚𝑖 and sequences 𝑞𝑚. Given 𝑅𝑄 = {𝑁𝑁(𝑞𝑚)}𝑚=1 𝑀 is a candidate items set of 𝑄. We also apply a time offset constraint for improving the accuracy of the sequence matching. The time offset constraint |𝑇𝑥𝑚 − 𝑇𝑞𝑚 | is the absolute difference between 𝑇𝑥𝑚 and 𝑇𝑞𝑚 . It can be defined as follows: |𝑇𝑥1 − 𝑇𝑞1 | = ⋯ = |𝑇𝑥𝑚 − 𝑇𝑞𝑚 |  where 𝑇𝑥𝑚 and 𝑇𝑞𝑚 are the offset time of reference file 𝑥𝑚 and query 𝑞𝑚, respectively. The constraint of offset time can be analyzed that a sequence of candidate items should occur with the same absolute difference among the time sequences. Our proposed has the following procedures. In summary, the proposed audio retrieval algorithm is based on two parts, the similarity (minimum distance) between the audio query and the song in the reference database, and the absolute difference among the time sequences. Algorithm: Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks Input: 𝑀 = {𝑚𝑖}𝑖=1 𝑁 𝑄 𝑁 reference set; query set; Output: SID Song ID is returned by proposed algorithm. Step 1: Extracting fingerprint features 𝑋 = {𝑥𝑖}𝑖=1 𝑁 ∈ 𝑅𝐷×𝑁 from the reference dataset. Step 2: Learning hash where 𝑥𝑖 ∈ 𝑅𝐷 is an input vector. The objective function defined as equations 15-16. In this step, we apply the learning function ℎ(𝑞) for audio fingerprint 𝑞 in step 3. Step 3: Sequence matching 1. The query sample is divided into 𝑀 fingerprints 𝑄 = {𝑞𝑚}𝑚=1 𝑀 2. 𝑠𝑖 = { }, 𝐴 = { }, 𝑆 = { }, 𝑅𝑄 = { } for 𝑖 = 1, 2, , 𝑀 do 𝑖𝑛𝑑𝑒𝑥𝑖 = ℎ (𝑞𝑖) 𝑠𝑖 = all items in 𝐴(𝑖𝑛𝑑𝑒𝑥𝑖) are collected into 𝑠𝑖 𝑆 = 𝑆 ∪ 𝑠𝑖 end for 𝑖 = 1, 2, , 𝑀 do Aux = maxValue for 𝑗 = 1, 2, , sizeOf(𝑠𝑖) do if ‖𝑥𝑗 − 𝑞𝑖‖ 2 < Aux Aux = ‖𝑥𝑗 − 𝑞𝑖 ‖ 2 absTime = |𝑇𝑥𝑗 − 𝑇𝑞𝑖 | SID = 𝑆𝐼𝐷𝑥𝑗 end end 𝑅𝑄 = 𝑅𝑄 ∪ {𝑆𝐼𝐷, 𝑎𝑏𝑠𝑇𝑖𝑚𝑒} end 3. Finding max frequency of each Song ID of 𝑅𝑄 where they have same absolute difference among the time sequences. 4. EXPERIMENTAL AND PERFORMANCE ANALYSIS 4.1. Database The performance of our proposed USH method is evaluated on the Extended Ballroom dataset freely available in [24, 25]. The dataset consists of 4,180 musical excerpts of 13 genres with a length of 30 seconds each. The audio quality of this data is 44.1kHz, 192-kbps, stereo, mp3 format. In this work, the audio signal is downsampled to 8KHz. to make the data easier to handle as previously mentioned. The training set (also used as reference database for retrieval) is composed of 3,000 tracks from 8 genres, the same rhythm class as our previous works [26, 27]. A set of 1,000 audio queries with a length of 10 seconds each are
  • 10.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891 888 randomly selected from those 3,000 tracks. Another set of 200 audio queries comes from audio files that do not appear in the database, in order to analyze the false positive rate. Each audio sample is represented by a 20-dimensional feature vector extracted by fingerprint algorithm. Table 2 shows the number of samples in the database and the query set. Table 2. Audio samples in the database and query set Set Number of tracks Length of segment (s) Number of samples Database 3,000 30 441,184 Query 1,200 10 67,785 4.2. Performance evaluation 4.2.1. Effectiveness of retrieval On a total of 1,200 audio queries, the retrieval results obtained from our proposed USH method and state-of-the-art data-independent method, the Shazam algorithm, are shown in Table 3. The false negative (FN) refers to the incorrect identification that the query audio does not exist in the database when it does, true positive (TP) refers to the correct identification of the audio recording from the query, false positive (FP) refers to the incorrect identification of the wrong recording when the correct recording does not exist in the database, and true negative (TN) refers to the correct identification that no audio recording matches the query. According to the experimental results, we obtain higher percentage of accuracy (88.92%) for the proposed USH than state-of-the-art data-independent method, the Shazam algorithm (71.67% for 16-bit hash code, 87.42% for 20-bit hash code). Figure 10 shows the retrieval accuracycomparison between the two different methods. Table 3. Retrieval results comparison between USH and state-of-the-art data-independent method Method FN TP FP TN USH 16-bit 114 886 19 181 Shazam 16-bit 290 710 50 150 Shazam 20-bit 132 868 19 181 Figure 10. The retrieval accuracy of the proposed USH and the Shazam algorithm The effectiveness of the USH is evaluated through the experiments and compared with state-of the-art data-independent method in terms of precision, recall, F1 score, and false positive rate [28], as follows: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃+𝐹𝑃 × 100 (20)  𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 × 100 (21) F1 𝑠𝑐𝑜𝑟𝑒 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 x 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (22) 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 = 𝐹𝑃 𝐹𝑃+𝑇𝑁 × 100 (23) As can be seen in Table 4, we obtain higher precision and recall values for the proposed USH than state-of-the-art data-independent method both in 16-bit and 20-bit hash code. The F1 score (in the fourth column) shows the overall effectiveness of two different methods. Furthermore, the USH has a significantly
  • 11. Int J Elec & Comp Eng ISSN: 2088-8708  Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat) 889 lower percentage of false positives (9.50%) than state-of-the-art data-independent method (25.00%) for the same 16-bit hash code. Its shows the superior performance of the USH on short codes. Table 4. Effectiveness comparison between USH and state-of-the-art data-independent method Method Precision Recall F1 score % False positive USH 16-bit 97.90 88.60 93.02 9.50 Shazam 16-bit 93.42 71.00 80.68 25.00 Shazam 20-bit 97.86 86.80 92.00 9.50 4.2.2. Storage cost As a result of our proposed USH method, the 20-bit fingerprint features can be mapped into the 16-bit binary code. Hence, it significantly reduces the database size by 16-fold, resulting in higher search performance. 4.3. Discussion The performance of the proposed USH for large audio database retrieval is evaluated through the experiments and compared with state-of-the-art data-independent hashing method, the Shazam algorithm, on a test set of 3,000 audio recordings. The experimental results support the effectiveness of the USH with high precision and recall values at 97.90% and 88.60% respectively. For the satisfactory results, the hash codes produced from our proposed method have similarity preserving property, i.e., the similarity items are mapped to the same hash code, the dissimilarity items are mapped to another one. The data-independent methods do not take into account for this property. The Shazam algorithm has higher percentage of false positives (25.00%) than the USH (9.50%) for the same 16-bit hash code. It shows that the Shazam algorithm is more likely to give incorrect identifications for the short-length codes and that is of inferior performance in audio retrieval. Furthermore, if the database size is increased tremendously in the future, it is most likely that the Shazam algorithm would result a significant number of false positive matches. Let’s consider the collection S of the USH and the Shazam algorithm which effect to the accuracy of audio retrieval. For the Shazam algorithm, the collection S consists of the data items where the search algorithm tries to find only the matching items for those search queries regardless of similarity preserving, and this may result in losing a number of relevant data. For the collection S of our proposed USH method consists of candidate data items where the search algorithm focuses on the similarity between the search queries and the items in the database. As shown in Table 5, with the collection S of the USH, the Song ID 48 is correctly identified by the two data items with the smallest distance (distance=1) at the same time offset. For the Shazam algorithm, the relevant song cannot be retrieved. Table 5. Example of the collection S of the proposed USH and the Shazam algorithm Method Song ID Time offset Distance Number of item(s) USH 16-bit 150 128 16 2 1041 239 28 2 936 -232 32 2 48* 152 1 2 306 228 48 2 2453 516 24 2 690 3720 32 1 690 2726 32 1 Shazam 16-bit 2732 164 - 2 847 19 - 2 684 2871 - 1 674 4475 - 1 675 2584 - 1 676 1564 - 1 676 4739 - 1 677 1820 - 1 Shazam 20-bit 1174 52 - 1 1176 557 - 1 1230 218 - 1 706 363 - 1 340 395 - 1 48 153 - 1 93 448 - 1 139 453 - 1 * Refers to the system correctly identifies the audio recording
  • 12.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891 890 The major factors for our superiority are that 1) the similarity-preserving hash codes produced from our proposed USH method, and 2) the audio retrieval algorithm proposed composes of 2 metrics, one is the 𝐿2-norm, and the other is absolute offset time difference. These factors increase the ability to identify the candidate items according to the similarity level of audio sample and the songs in the reference database. And this significantly improves the retrieval performance. 5. CONCLUSION In this paper, an unsupervised similarity-preserving hashing (USH) method for content-based audio retrieval is proposed. We develop a deep network with multiple hierarchical layers of nonlinear and linear transformations to learn compact hash codes where the similarity between samples is preserved. The independence and balance properties are included and optimized in the objective function to improve the codes. The experimental results on the Extended Ballroom dataset show the superiority of our proposed method over state-of-the-art data-independent method. It is suggested future work should be focused on extending USH to supervised hashing by leveraging the semantic labels to enhance the retrieval performance. REFERENCES [1] P. Grosche, M. Müller, and J. Serrà, “Audio Content-Based Music Retrieval,” Multimodal Music Processing. Dagstuhl Follow-Ups, vol. 3, pp. 157-174, 2012. [2] J. Haitsma, and T. Kalker, “A highly robust audio fingerprinting system with an efficient search strategy,” Journal of New Music Research, vol. 32, no. 2, pp. 211-222, 2003. [3] A. L. Wang, “An Industrial-Strength Audio Search Algorithm,” 4th International Conference on Music Information Retrieval (ISMIR 2003), pp. 7-13, 2003. [4] P. Panyapanuwat, S. Kamonsantiroj, and L. Pipanmaekaporn, “Time-Frequency Ratio Hashing for Content-Based Audio Retrieval,” 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi, pp. 205-210, 2017. [5] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” 25th International Conference on Very Large Data Bases (VLDB’99), pp. 518-529, 1999. [6] B. Kulis, P. Jain, and K. Grauman, “Fast Similarity Search for Learned Metrics,” In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2143-2157, 2009. [7] M. Raginsky, and S. Lazebnik, “Locality-Sensitive Binary Codes from Shift-Invariant Kernels,” 23rd Annual Conference on Neural Information Processing Systems (NIPS’09), pp. 1509-1517, 2009. [8] Y. Zheng, J. Zhuand, W. Fangand, and L.-H. Chi, “Deep Learning Hash for Wireless Multimedia Image Content Security,” Journal of Security and Communication Networks, vol. 2018, pp. 1-13, 2018. [9] J. Wang, W. Liu, S. Kumar, and S.-F. Chang, “Learning to Hash for Indexing Big Data-A Survey,” Proceedings of the IEEE, vol. 104, no. 1, pp. 34-57, 2016. [10] Y. Weiss, A. Torralba, and R. Fergus, “Spectral Hashing,” 21st International Conference on Neural Information Processing Systems (NIPS’08), pp. 1753-1760, 2008. [11] B. Kulis, and K. Grauman, “Kernelized Locality-Sensitive Hashing for Scalable Image Search,” 2009 IEEE 12th International Conference on Computer Vision (ICCV), Kyoto, pp. 2130-2137, 2009. [12] Y. Gong, and S. Lazebnik, “Iterative Quantization: A Procrustean Approach to Learning Binary Codes,” IEEE Conference on Computer Vision and Pattern recognition (CVPR 2011), Providence, RI, pp. 817-824, 2011. [13] B. Kulis, and T. Darrell, “Learning to Hash with Binary Reconstructive Embeddings,” 22nd International Conference on Neural Information Processing Systems (NIPS’09), pp. 1042-1050, 2009. [14] W. Liu, J. Wang, R. Ji, Y. G. Jiang, and S. F. Chang, “Supervised Hashing with Kernels,” 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, pp. 2074-2081, 2012. [15] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming Distance Metric Learning,” 25th International Conference on Neural Information Processing Systems (NIPS’12), pp. 1061-1069, 2012. [16] J. Wang, S. Kumar, and S. F. Chang, “Semi-Supervised Hashing for large-Scale Search,” In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 12, pp. 2393-2406, 2012. [17] F. Shen, C. Shen, W. Liu W, and H. T. Shen, “Supervised Discrete Hashing,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 37-45, 2015. [18] X. Bai, H. Yang, J. Zhou, P. Ren, and J. Cheng, “Data-dependent Hashing Based on p-Stable Distribution,” IEEE Transactions on Image Processing, vol. 23, no. 12, pp. 5033-5046, 2014. [19] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A Survey on Learning to Hash,” In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 769-790, 2018. [20] J. He, S. F. Chang, R. Radhakrishnan, and C. Bauer, “Compact hashing with joint optimization of search accuracy and time,” IEEEConference on Computer Vision and Pattern Recognition (CVPR2011),Providence,RI,pp. 753-760, 2011. [21] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A Review of Audio Fingerprinting,” Journal of VLSI Signal Processing, vol. 41, pp. 271-284, 2005. [22] Dan Ellis, “Robust Landmark-Based Audio Fingerprinting,” 2015. [Online]. Available: https://labrosa.ee.columbia.edu/matlab/fingerprint/.
  • 13. Int J Elec & Comp Eng ISSN: 2088-8708  Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat) 891 [23] T. T. Do, A. D. Doan, and N. M. Cheung, “Learning to Hash with Binary Deep Neural Network,” 14th European Conference on Computer Vision (ECCV 2016), pp. 219-234, 2016. [24] U. Marchand, and G. Peeters. “Scale and shift invariant time/frequency representation using auditory statistics: Application to rhythm description,” 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), Vietri sul Mare, pp. 1-6, 2016. [25] U. Marchand, and G. Peeters, “The Extended Ballroom Dataset,” 17th International Society for Music Information Retrieval Conference (ISMIR 2016) Late-Breaking Session, New-York, USA, pp. 1-3, 2016. [26] P. Panyapanuwat, S. Kamonsantiroj, and L. Pipanmaekaporn, “Unsupervised Learning Hash for Content-Based Audio Retrieval Using Deep Neural Networks,” 2019 11th International Conference on Knowledge and Smart Technology (KST), Phuket, Thailand, pp. 99-104, 2019. [27] P. Panyapanuwat, and S. Kamonsantiroj, “Performance Comparison of Unsupervised Deep Hashing with Data- independent Hashing for Content-Based Audio Retrieval,” 2019 2nd International Conference on Electronics, Communications and Control Engineering, pp. 16-20, 2019. [28] C. Manning, P. Raghavan, and H. Schütze, “An Introduction to Information Retrieval,” Cambridge University Press, 2009. BIOGRAPHIES OF AUTHORS Petcharat Panyapanuwat holds a bachelor’s in mathematics and a master’s degree in software engineering. She is currently a Ph.D. candidate at Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand. Her current research interest focuses on music information retrieval. Suwatchai Kamonsantiroj is currently a lecturer at Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand. He holds a bachelor’s degree in mechanical engineering and a master’s degree in information technology management. He also earned his doctoral degree in computer engineering from Kasetsart University, Thailand, graduating in 2008. His current research interests include neural network, time series analysis, and artificial intelligence. Luepol Pipanmaekaporn is currently a lecturer at Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand. He holds both a bachelor’s and a master’s degree in computer science. He also earned his doctoral degree in computer science from Queensland University of Technology, Australia, graduating in 2013. His current research interests include information retrieval, web mining, and data mining.