Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks

International Journal of Electrical and Computer Engineering (IJECE)
Vol. 11, No. 1, February 2021, pp. 879~891
ISSN: 2088-8708, DOI: 10.11591/ijece.v11i1.pp879-891  879
Journal homepage: http://ijece.iaescore.com
Similarity-preserving hash for content-based audio retrieval
using unsupervised deep neural networks
Petcharat Panyapanuwat, Suwatchai Kamonsantiroj, Luepol Pipanmaekaporn
Department of Computer and Information Science, King Mongkut’s University of Technology North Bangkok, Thailand
Article Info ABSTRACT
Article history:
Received Jan 1, 2020
Revised Jun 8, 2020
Accepted Aug 18, 2020
Due to its efficiency in storage and search speed, binary hashing has become
an attractive approach for a large audio database search. However, most
existing hashing-based methods focus on data-independent scheme where
random linear projections or some arithmetic expression are used to construct
hash functions. Hence, the binary codes do not preserve the similarity and
may degrade the search performance. In this paper, an unsupervised
similarity-preserving hashing method for content-based audio retrieval is
proposed. Different from data-independent hashing methods, we develop a
deep network to learn compact binary codes from multiple hierarchical layers
of nonlinear and linear transformations such that the similarity between
samples is preserved. The independence and balance properties are included
and optimized in the objective function to improve the codes. Experimental
results on the Extended Ballroom dataset with 8 genres of 3,000 musical
excerpts show that our proposed method significantly outperforms state-of-
the-art data-independent method in both effectiveness and efficiency.
Keywords:
Content-based audio retrieval
Deep learning
Deep neural networks
Similarity-preserving hash
Unsupervised learning
This is an open access article under the CC BY-SA license.
Corresponding Author:
Petcharat Panyapanuwat,
Department of Computer and Information Science,
King Mongkut’s University of Technology North Bangkok,
Bangkok, Thailand.
Email: panyapetch@hotmail.com
1. INTRODUCTION
With rapidly growing database of digital audio recordings, the novel retrieval strategies have
received great attention. Early retrieval approach uses textual metadata describing the content of music audio
(e.g., artist name, song title, album name, genre, or release year of music). In case such descriptions are not
available, it is required content-based retrieval strategy that the perceptual aspects of the audio are utilized. [1].
Content-based audio retrieval approach is generally solved with two steps: first, features are
extracted from the audio file and then used to build indexes for searching. Two main issues of performing a
search over a large database are search speed and efficient storage. The most interesting approach for handling
these problems is binary hashing, where the high-dimensional features are encoded into compact binary codes.
There have been several hashing methods proposed in the literature. They can be devided into two
categories, data-independent methods and data-dependent methods. Methods in data-independent category [2-7]
use random linear projections or some arithmetic expression to construct hash functions. Without the training
process, they are robust to data variation. However, such methods require long hash codes to achieve high
precision. This increases the storage cost and degrades the search efficiency [8].
Methods in data-dependent category, also called learning to hash methods, aim to learn a set of hash
functions from available training data that yield compact codes to achieve satisfactory search performance [9].
Existing data-dependent methods can be classified into unsupervised, supervised, and semi-supervised

 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 1, February 2021 : 879 - 891
880
learning approach. Unsupervised hashing methods [10-12] use unlabeled data to build the hash functions
where the neighbor distance (e.g., L2 norm) among the training data is preserved. Supervised or semi-
supervised hashing methods [13-17] attemp to improve the quality of hashing by leveraging the semantic
labels into the learning process. Compared with data-independent methods, it appears that data-dependent
methods can achieve better accuracy with shorter codes [12, 14, 17]. However, data-dependent methods may
be too dependent on the training data [18].
There are both advantages and shortcomings of using data-independent and data-dependent
methods. However, the previous works of the two categories do not fully take into consideration the
similarity preserving and this may degrade the retrieval performance. In this work, an unsupervised
similarity-preserving hashing method for content-based audio retrieval is proposed. We develop a deep
network with several hierarchical layers of nonlinear and linear transformations to learn compact binary
codes where the similarity between samples is preserved. Furthermore, the independence and balance
properties are included in the objective function to improve the codes. The proposed method is compared
with the Shazam algorithm [3], the data-independent hashing method, in terms of accuracy, precision, recall,
false positive rate, and the storage cost.
2. BACKGROUND
2.1. Learning to hash
Learning to hash attempts to learn a hash function 𝑦 = ℎ(𝑥) that maps a high-dimensional input
item 𝑥 ∈ 𝑅𝐷
to a compact code 𝑦, aiming to improve the search performance [19]. There are 4 topics to
consider for the learning to hash: (1) hash function, (2) similarity-preserving, (3) loss function, and (4) deep
learning to hash.
2.1.1. Hash function
There are several ways to design hash functions. The most widely used hash functions are
generalized by linear projection as shown in (1).
𝑦 = ℎ(𝑋) = sgn(𝑓(𝑊𝑇
𝑋 + 𝑏)) 

where 𝑦 ∈ {0,1} or {−1,1}, 𝑋 = {x𝑛}𝑛=1
𝑁
∈ 𝑅𝐷x𝑁
is the training set which contains 𝑁 samples, 𝐷 is
the dimension of input vector, 𝑊 = {𝑤𝑘}𝑘=1
𝐾
∈ 𝑅𝐷x𝐾
is the projection vector, 𝐾 is number of hash bits, 𝑏 is
the bias variable, sgn(𝑧) = −1 or 0 if 𝑧 < 0 and sgn(𝑧) = 1 otherwise, 𝑓(∙) is a predefined function which
can possibly be neural networks or nonlinear function. However, using different 𝑓(∙) yields different hash
function properties.
2.1.2. Similarity-preserving
The distance 𝑑𝑖𝑗 between two items 𝑥𝑖 and 𝑥𝑗 can be defined by the standardized Euclidean distance
‖𝑥𝑖 − 𝑥𝑗‖
2
or others. The similarity 𝑠𝑖𝑗 between those items is often defined as a function of the distance 𝑑𝑖𝑗
(e.g., Gaussian function, cosine similarity, and so on). In addition, the semantic similarity approach is
generally used in similarity search application. We can apply any distance to the hashing algorithm for
semantic similarity, such as Euclidean distance, by defining semantic similarity 𝑠𝑖𝑗 = 1 for adjacent points
and 𝑠𝑖𝑗 = 0 or −1 for farther points.
In the hash coding space, the Hamming distance 𝑑𝑖𝑗
𝐻
between the code 𝑦𝑖 and 𝑦𝑗 can be defined as
‖𝑦𝑖 − 𝑦𝑗‖
1
= ∑ ‖ℎ𝑘(𝑥𝑖) − ℎ𝑘(𝑥𝑗)‖
𝐾
𝑘=1 . It is the number of binary digits where the values are different.
Hamming similarity is defined as 𝑠𝑖𝑗
𝐻
= 𝐾 − 𝑑𝑖𝑗
𝐻
for the codes valued by 1 and 0. For the codes valued by 1
and -1, the inner product 𝑠𝑖𝑗
𝐻
= 𝑦𝑖
𝑇
𝑦𝑗 is defined as the similarity.
Let’s focus on the term of similarity preserving. In Figure 1(a), there is a set of three points (𝑥1, 𝑥2,
and 𝑥3) in an input space. By measuring the Euclidean distance between the points, we can find that 𝑥1 is
closer to 𝑥2 than to 𝑥3, i.e., 𝑥1 is more similar to 𝑥2 than 𝑥3. The ℎ(𝑥1), ℎ(𝑥2), and ℎ(𝑥3) are
the representations of 𝑥1, 𝑥2, and 𝑥3 in the hash coding space (or Hamming space), respectively.
From the Figure 1(b), we can see ℎ(𝑥1) is closer to ℎ(𝑥3) while ℎ(𝑥2) is far away. In this case, it shows that
the similarities are not preserved. Figure 1(c), on the other hand, shows an example of the similarities that are
well preserved.

Int J Elec & Comp Eng ISSN: 2088-8708 
Similarity-preserving hash for content-based audio retrieval using … (Petcharat Panyapanuwat)
881
Similar
x1
x2
x3
Dissimilar
Hamming Space (2-dim)
h(x2)
h(x3)
0
1
1
h(x1)
Hamming Space (2-dim)
h(x1)
h(x2)
h(x3)
0
1
1
Similarities are not
preserved
Similarities are
well preserved
Hashing
Hashing
(a) (b)
(c)
Figure 1. Similarity-preserving hashing
2.1.3. Loss function
The loss function is intended to preserve the similarity order, i.e., minimize the difference between
the nearest neighbor search result in the hash coding space and the search result in the input space. The loss
function 𝐿𝑜𝑠𝑠(𝑋, 𝑊) is defined as follows:
𝐿𝑜𝑠𝑠(𝑋, 𝑊) = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ ‖𝑑𝑖𝑗 − 𝑑𝑖𝑗
𝐻
‖
2
𝑁
𝑥𝑖,𝑥𝑗∈𝑋  
where 𝑋 is the input data, and 𝑊 is the projection vector.
Specifically, 𝑦𝑖 = ℎ(𝑥𝑖) needs to be binary. This binary constraint leads to a difficult optimization
problem. To solve the problem, we drop the binary constraint and let the codes be continuous. The codes are
then binarized with thresholding. For binary constraint relaxation, various standard optimization techniques
can be applied.
2.1.4. Deep learning to hash
The goal of learning to hash is to learn the specific hash functions that map high dimensional input
vector to a compact binary vector that yields a good quality of retrieval and search speed [20]. For unlabeled
data, an illustration of unsupervised deep learning to hash model that map the input vector 𝑥 ∈ 𝑅𝐷
to
compact binary codes is shown in Figure 2.
w(1)
w(L -1)
w(L)
...
...
...
Layer 1 Layer 2 Layer L-1 Layer L Layer L+1
(Input Layer) (Reconstruction Layer)
D
x R
 ˆ D
x R

Figure 2. Unsupervised deep learning model

 ISSN: 2088-8708
882
Assume that an unsupervised deep network consists of L+1 layers. A binary vectoy 𝑦𝑖 is generated
by passing the input vector 𝑥𝑖 through the network that contains multiple hierarchical layers of nonlinear
functions. The binary code of 𝑥𝑖 at Lth layer can be calculated as follows:
𝑦𝑖 = ℎ(𝑥𝑖) = 𝑠𝑔𝑛(𝐹(𝑥𝑖,𝑊)) 
where 𝐹(𝑥𝑖, 𝑊) is a composition of nonlinear transformations defined as follows:
𝐹(𝑥𝑖, 𝑊) = 𝑓𝐿(∙∙∙ 𝑓2 (𝑓
1(𝑥𝑖, 𝑤(1)), 𝑤(2)) ∙∙∙ 𝑤(𝐿)) 
where the vector 𝑥𝑖 and the weight vector 𝑤(𝑙) are used as input, the projection 𝑥𝑖+1 is produced by 𝑓𝑖(∙). The
learning algorithm aims to learn a set of nonlinear weight vectors 𝑊 = {𝑤(1), …, 𝑤(𝐿)} where the information
from the input space is preserved.
2.2. Search with hashing
There are two strategies to perform a search with hashing, hash code ranking and hash table lookup [19].
For the hash code ranking, an exhaustive search is performed by comparing the distance (e.g., Hamming
distance) between the query and the reference items. The items with the smallest distances, called nearest
neighbors, are retrieved. However, the cost of computing the distance results in performance degradation.
The alternative approach, hash table lookup aims to accelerate the search by reducing the distance
computations. The inverse lookup database, called hash table, is composed of buckets which are indexed by
the hash codes. Given the query, the matching items storing in the bucket are retrieved.
2.3. Audio fingerprinting
Audio fingerprinting is best known for its ability to identify an uknown audio recording by using its
compact content-based signature so-called fingerprint [21]. It does this by converting the audio features into
hash codes, aiming to uniquely identify an audio recording. The advantage of fingerprint is that, it reduces
storage costs as fingerprint is relatively small. Moreover, the perceptual irrelevancies have been removed
from fingerprint, resulting in efficient comparison and searching.
3. METHOD
The aim of this paper is to provide an efficient technique that yields a good quality of retrieval and
computational efficiency. In this work, compact binary codes are learned for fingerprint indexing with
unsupervised deep network in a way that the similarity between samples is preserved. Once, a short audio
sample is taken to our content-based audio retrieval system, the system performs database lookup for
matching track and then returns the song ID that the query is taken. As shown in Figure 3, the system is
designed with three steps: (1) Fingerprint feature extraction, (2) Unsupervised similarity-preserving hashing,
and (3) Sequence matching.
Reference Audio
Unsupervised Similarity-Preserving Hashing
Deep Learning
to Hash
Hash Functions
16-bit Hash
X1,X3
X2
...
Xn
Items
Database
16-bit Hash
Sequence
Matching
Song ID
Returned Item
Query Audio
0
20 40 60 100
80 120 140 160 180
Frequency
500
1000
1500
2500
2000
3000
3500
4000
(t2,f2)
(t1,f1)
Δf = f2 - f1, Δt = t2 - t1
Features = [f1, Δf, Δt]
Time
0
20 40 60 100
80 120 140 160 180
Frequency
500
1000
1500
2500
2000
3000
3500
4000
(t2,f2)
(t1,f1)
Candidate Point
Δf = f2 - f1, Δt = t2-t1
Features = [f1, Δf, Δt]
Feature Extraction
Feature Extraction
h1
2
h1
3
h1
19
h2
2
h2
18
h3
16
x3
x4
x2
x1
h1
1
h2
1
h3
1
...
...
...
...
...
1
ˆx
2
ˆx
3
ˆx
4
ˆx
20
x̂
x20
Candidate Point
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Input Layer
Reconstruction
Layer
Target Region
Target Region
Figure 3. The construction of proposed method for our content-based audio retrieval system

883
3.1. Fingerprint feature extraction
Before fingerprint feature extraction is performed, the audio signal is converted into a common
format for analysis. Next, the time-series audio signal is converted into time-frequency domain from which
more meaningful information can be extracted. Below each aspect is detailed.
3.1.1. Preprocessing and transform
In this paper, the fingerprint extraction presented in [3] is applied. Firstly, we convert the input
audio to mono signal and downsampled from the standard digital audio of 44.1KHz, to 8KHz, to make
the data easier to handle, reducing database size, and increasing speed of the algorithm. The audio signal is
then converted into the time-frequency representation. We perform a short-time fourier transform (STFT)
with a window size of 64 ms. for good spectrum resolution [22] and a hop size of 32 ms. Figure 4 shows
time-frequency graph so-called a spectrogram. On the horizontal axis is time, on the vertical is frequency, and
on the third is intensity. Each point on the graph represents the intensityof a given frequency at specific time.
Figure 4. Spectrogram with peak intensities
3.1.2. Feature extraction
After converting the signal into the time-frequency domain, the features are then extracted from
the spectrum. Due to their robustness to noise and distortions, the amplitude peaks in each frame are selected
as candidate points. Each candidate point is paired with the adjacent peaks. The constellation map of paired
points with coordinate list is shown in Figure 5. In this work, each candidate point is paired within 31
frequency bins and 63 time frames. Only the closest 3 peaks in time to each other are selected. Figure 6
shows the combinatorial association of a pair of two points which is called a ‘landmark’. For each pair, it consists
of four components, the starting frequency 𝑓
1 , the startingtime 𝑡1, the end frequency 𝑓2, and the end time 𝑡2.
Figure 5. A constellation map of paired points
(t2,f2)
Time
Frequency
(t1,f1)
Candidate Point
31 Frequency Bin
63 Time Frame
Δf = f2 - f1, Δt = t2 - t1
Feature = [f1, Δf, Δt]
31 Frequency Bin
Figure 6. The combinatorial association of a pair of
two points

 ISSN: 2088-8708
884
3.1.3. Audio fingerprint
For the landmark as mentioned above, the audio fingerprint can be defined as follows:
𝐹𝑖𝑛𝑔𝑒𝑟𝑝𝑟𝑖𝑛𝑡 = [𝑓
1, Δ 𝑓, Δ𝑡] 
where the frequency difference ∆𝑓 = 𝑓2 − 𝑓1, and the time difference between the two points ∆𝑡 = 𝑡2 − 𝑡1.
The fingerprint is also associated with the offset time fromthe beginningof the audio file to the startingtime 𝑡1.
This fingerprint feature [𝑓
1 , Δ 𝑓, Δ𝑡] is used to generate hash code in [3]. The hash model can be
defined as shown in (6).
𝑓1 × 212
+ ∆𝑓 × 26
+ ∆𝑡 
where a fingerprint hash is composed of 8-bit frequency 𝑓1, 6-bit frequency difference ∆𝑓, and 6-bit time
difference ∆𝑡. Figure 7 shows an example of 20-bit hash address calculated from (6).
[216, 18, 36]
Hash Function
885924 =
Hash Address
12 6
1 2 2
f f t
      11011000010010100100
1,
[ , ]
f f t
 
Figure 7. An example of 20-bit hash address
For a 16-bit fingerprint hash, it is composed of 6-bit frequency 𝑓
1 , 5-bit frequency difference ∆𝑓,
and 5-bit time difference ∆𝑡. The hash model can be defined as shown in (7).
𝑓1 × 210
+ ∆𝑓 × 25
+ ∆𝑡 (7)
After the hash code is calculated, the system then uses this code as an index for searching in
the database. An exact matching algorithm is applied in [3]. Unlike the Shazam algorithm, we develop a deep
neural network with multiple hierarchical layers of nonlinear and linear transformations to learn compact
codes from these fingerprint features such that the similarity between samples is preserved. The details are
described further in the next section.
3.2. Unsupervised similarity-preserving hashing (USH)
In this paper, the hash transformations are created by an unsupervised deep neural network.
As shown in Figure 8, there are 5 layers in our deep network: the input layer consists of 20 nodes of input 𝑥𝑖,
the three hidden layers consist of 19, 18, and 16 nodes respectively, and there are 20 nodes of 𝑥𝑖
̂ in
the output layer.
...
...
...
...
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
...
20 20
18
19
16
(1), (1),
i i
f x

(1), (1),
i i
a f

(2),
(2),
1
1 i
i H
f
e


(2), (2),
i i
a f

(3),
(3),
1
1 i
i H
f
e


(3), (3),
i i
a f

(4), (4),
(4), (4),
(4),
i i
i i
i H
H H
H
e e
f
e e





(4), (4),
i i
a f

(5), (5),
i i
f H

(5), (5),
i i
a f

Figure 8. Our proposed unsupervised similarity-preserving hashing network (USH)

885
Our deep network is learned so that the output of the fourth layer can be used as the binary hash codes.
For the network design, each node is composed of one input summation function and one output transformation
function. The function f(∙) is used to combine information by the links from other nodes, as shown in (8).
𝑛𝑜𝑑𝑒𝑖𝑛 = 𝑓(𝑥(𝑙),1 , 𝑥(𝑙),2 , … , 𝑥(𝑙),𝑛 ; 𝑤(𝑙),1 , 𝑤(𝑙),2 , …, 𝑤(𝑙),𝑛) (8)
where 𝑥(𝑙),1 , 𝑥(𝑙),2 , … , 𝑥(𝑙),𝑛 are the inputs to the node, 𝑤(𝑙),1 , 𝑤(𝑙),2 , …, 𝑤(𝑙),𝑛 are the associated weights,
𝑙 indicates the layer number, and 𝑛 is the number of input nodes. The output activation function 𝑎(𝑓(∙)) is
shown in (9).
𝑛𝑜𝑑𝑒𝑜𝑢𝑡 = 𝑎(𝑙)(𝑛𝑜𝑑𝑒𝑖𝑛) = 𝑎(𝑙)𝑓(∙) (9)
Let 𝐻𝑖 be the sum of product of input 𝑥𝑖 and weight 𝑤𝑖, 𝐻𝑖 = ∑𝑤𝑖 𝑥𝑖, the functions of the nodes
from layer 1 to layer 5 of our proposed network are defined as follows:
Layer 1: In this layer, the nodes only convey inputs to the nodes of the next consecutive layer.
The functions of the 𝑖th node are shown as;
𝑓(1),𝑖 = 𝑥(1),𝑖 and 𝑎(1),𝑖 = 𝑓(1),𝑖 
Layer 2: For this layer, the sigmoid function is used as activation function. Let 𝑥(2),𝑖 = 𝑎(1),𝑖 and
𝐻(2),𝑖 = ∑𝑤(2),𝑖 𝑥(2),𝑖. Thus, the functions of the 𝑖th node are defined as;
𝑓(2),𝑖 =
1
1+𝑒
−𝐻(2),𝑖
and 𝑎(2),𝑖 = 𝑓(2),𝑖  

Layer 3: In this layer, the sigmoid function is applied for activation function. Let 𝑥(3),𝑖 = 𝑎(2),𝑖 and
𝐻(3),𝑖 = ∑𝑤(3),𝑖 𝑥(3),𝑖. The functions of the 𝑖th node are defined as;
𝑓(3),𝑖 =
1
1+𝑒
−𝐻(3),𝑖
and 𝑎(3),𝑖 = 𝑓(3),𝑖  
Layer 4: The output of each node in this layer will be used as the binary codes. During training,
these codes are used to reconstruct the input data at the output layer. The hyperbolic tangent function is
particularly used as activation function in this layer. Let 𝑥(4),𝑖 = 𝑎(3),𝑖 and 𝐻(4),𝑖 = ∑𝑤(4),𝑖 𝑥(4),𝑖.
The functions of the 𝑖th node are defined as;
𝑓(4),𝑖 =
𝑒
𝐻(4),𝑖−𝑒
−𝐻(4),𝑖
𝑒
𝐻(4),𝑖+𝑒
−𝐻(4),𝑖
and 𝑎(4),𝑖 = 𝑓(4),𝑖  

Layer 5: This layer is the output or reconstruction layer. To preserve the similarity between samples,
thus, the target outputs are given the same as the inputs of layer 1. Let 𝑥(5),𝑖 = 𝑎(4),𝑖 and 𝐻(5),𝑖 =
∑𝑤(5),𝑖 𝑥(5),𝑖. The functions of the 𝑖th output node are defined as;
𝑓(5),𝑖 = 𝐻(5),𝑖 and 𝑎(5),𝑖 = 𝑓(5),𝑖  

To achieve the efficient binary codes, we include constraints in the objective function so that
the codes have 4 properties: (1) belonging to {1, -1}, (2) similarity-preserving, (3) independent, and (4)
balancing. In this paper, the method presented in the UH-BDNN [23] is applied to optimize the objective
function which is defined as follows:
𝑚𝑖𝑛𝑊,𝑏 𝐿𝑜𝑠𝑠 =
1
2𝑁
‖𝑋 − (𝑊(𝐿−1)𝑌 + 𝑏(𝐿−1) × [1]1×𝑁)‖
2
+
𝜆1
2
∑ ‖𝑊(𝑙)‖
2
𝐿−1
𝑙=1 +
𝜆2
2𝑁
‖𝐻(𝐿−1) − 𝑌‖
2
+
𝜆3
2
‖
1
𝑁
𝐻(𝐿−1)𝐻(𝐿−1)
𝑇
− 𝐼‖
2
+
𝜆4
2𝑁
‖𝐻(𝐿−1)[1]𝑁×1‖
2
(15)
𝑠. 𝑡. 𝑌 ∈ {1, −1}𝐾×𝑁
 

 ISSN: 2088-8708
886
where 𝑋 ∈ 𝑅𝐷×𝑁
is a set of 𝑁 training data with 𝐷 dimension, 𝑌 ∈ {1, −1}𝐾×𝑁
is output binary code of 𝑋, 𝐾
is number of bits, 𝐿 is number of layers, 𝑊(𝑙) is weight matrix between layer 𝑙 + 1 and layer 𝑙, 𝑏 is bias vector
for nodes in layer 𝑙 + 1, 𝐻(𝑙) = 𝑓(𝑙)(𝑤(𝑙−1)𝐻(𝑙−1) + 𝑏(𝑙−1)[1]1×𝑛) is the output values of layer 𝑙, 𝐻(1) = 𝑋, 𝑓(𝑙) is
activation function of layer 𝑙, and 𝜆1 - 𝜆4 are the parameters for optimizing the objective function.
The first term of (15) makes sure that the binary code allows a good reconstruction of 𝑋. The second
term is a weight regularization that encourages the network to keep the weights small in order to reduce
overfitting. The third term measures the equality constraint violation. The fourth term is the independence,
and the fifth term is balance of the binary codes. As shown in (16) is to ensure that each bit of the binary
codes belongs to {1, -1}. After the efficient codes are produced from deep-learning network, these codes are
used as search index in our content-based audio retrieval system. The song ID, 𝑡1, 𝑓1, ∆𝑓 and ∆𝑡 are stored at
their hash address in the database. Table 1 shows the representation of information data.
Table 1. Representation of information data
Method Index Information data
USH 16 bits Hash Address Song ID, 𝑡1 , 𝑓
1 , ∆𝑓, ∆𝑡
Shazam [3] 16-bit / 20-bit Hash Address Song ID, 𝑡1
3.3. Sequence matching
For the query step, a sequence of query features is generated to a set of compact hash codes and
used for searching in the inverse lookup database. Let 𝑄 represent a set of sequences of query features,
𝑄 = {𝑞1, 𝑞2, … , 𝑞𝑀}, where 𝑞𝑚 is a query at oreder 𝑚, 𝑚 = 1, 2, …, 𝑀, and 𝑀 is the total number of
sequences. The learning hash function 𝐻: 𝑞𝑚 → ℎ(𝑞𝑚), is used to map the query features to binary hash
codes. We can define 𝑄 = {𝑞𝑚}𝑚=1
𝑀
to the corresponding binary codes as follows:
𝑌 = 𝐻(𝑄) = {ℎ(𝑞1), ℎ(𝑞2), … , ℎ(𝑞𝑀)} 
where 𝑌 is the hash codes of 𝑄, 𝑌 ∈ {1,−1}𝐾𝑥𝑄
, and 𝐾 is the number of bits. After learning deep network,
we obtain a set of items that indexed by the hash address. Let 𝑠𝑚 = {𝒙𝑚1, 𝒙𝑚2, … , 𝒙𝑚𝑛𝑚
} be a set of items of
𝑞𝑚, 𝒙𝑚𝑖 ∈ 𝑅5x1
be the information vector that are stored in the database, and 𝑛𝑚 is the number of items of
𝑞𝑚. Given 𝑆 = {𝑠𝑚}𝑚=1
𝑀
is a set of 𝑠𝑚 where 𝑚 = 1,2,… , 𝑀, the sequence matchingprocess is shown in Figure 9.
A Sequence of Queries
Q = {q1, q2, q3}
q1
q2
q3
Learning
Hash Function
H(q1)
H(q2)
H(q3)
100010...1000
100010...1001
100010...1010
100010...1011
100010...1100
111111...1111
The Reference
Inverse Lookup
Database
[ ]5x1 [ ]5x1
x21 x22
[ ]5x1 [ ]5x1 [ ]5x1 [ ]5x1
x11 x12 x13 x14
[ ]5x1 [ ]5x1 [ ]5x1
x31 x32 x33
S
x11
x12
x14
x21
x22
x31
x32
x13
x33
Information Data
[Song ID, t1, f1,, Δf, Δt]
Most frequently
occurred similar relative
offsets time
Minimum Distance
Song ID
Returned Item
Data Collection
Figure 9. Our proposed sequence matching

887
As can be seen in Figure 9, assume that 𝑄 = {𝑞1, 𝑞2, 𝑞3}, and get the binary codes 𝐻(𝑄), we obtain a
set of 𝑠𝑚, 𝑠1 = {𝑥11, 𝑥12, 𝑥13, 𝑥14}, 𝑠2 = {𝑥21, 𝑥22}, 𝑠3 = {𝑥31, 𝑥32, 𝑥33}, and 𝑆 = {𝑥11, 𝑥12, 𝑥13, 𝑥14 , 𝑥21, 𝑥22,
𝑥31, 𝑥32, 𝑥33}. One nearest neighbor search 𝑁𝑁(𝑞𝑚) for a queryitem at order 𝑚 from 𝑠𝑚 is defined as follows:
𝑁𝑁(𝑞𝑚) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥𝑚𝑖∈𝑠𝑚
‖𝑥𝑚𝑖 − 𝑞𝑚‖2 
where ‖𝑥𝑚𝑖 − 𝑞𝑚‖2 is the 𝐿2-norm between 𝑥𝑚𝑖 and sequences 𝑞𝑚. Given 𝑅𝑄 = {𝑁𝑁(𝑞𝑚)}𝑚=1
𝑀
is
a candidate items set of 𝑄. We also apply a time offset constraint for improving the accuracy of the sequence
matching. The time offset constraint |𝑇𝑥𝑚
− 𝑇𝑞𝑚
| is the absolute difference between 𝑇𝑥𝑚
and 𝑇𝑞𝑚
. It can be
defined as follows:
|𝑇𝑥1
− 𝑇𝑞1
| = ⋯ = |𝑇𝑥𝑚
− 𝑇𝑞𝑚
| 
where 𝑇𝑥𝑚
and 𝑇𝑞𝑚
are the offset time of reference file 𝑥𝑚 and query 𝑞𝑚, respectively. The constraint of
offset time can be analyzed that a sequence of candidate items should occur with the same absolute
difference among the time sequences. Our proposed has the following procedures.
In summary, the proposed audio retrieval algorithm is based on two parts, the similarity (minimum
distance) between the audio query and the song in the reference database, and the absolute difference among
the time sequences.
Algorithm: Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks
Input:
𝑀 = {𝑚𝑖}𝑖=1
𝑁
𝑄
𝑁 reference set;
query set;
Output:
SID Song ID is returned by proposed algorithm.
Step 1: Extracting fingerprint features 𝑋 = {𝑥𝑖}𝑖=1
𝑁
∈ 𝑅𝐷×𝑁
from the reference dataset.
Step 2: Learning hash where 𝑥𝑖 ∈ 𝑅𝐷
is an input vector. The objective function defined as equations
15-16. In this step, we apply the learning function ℎ(𝑞) for audio fingerprint 𝑞 in step 3.
Step 3: Sequence matching
1. The query sample is divided into 𝑀 fingerprints 𝑄 = {𝑞𝑚}𝑚=1
𝑀
2. 𝑠𝑖 = { }, 𝐴 = { }, 𝑆 = { }, 𝑅𝑄 = { }
for 𝑖 = 1, 2, , 𝑀 do
𝑖𝑛𝑑𝑒𝑥𝑖 = ℎ (𝑞𝑖)
𝑠𝑖 = all items in 𝐴(𝑖𝑛𝑑𝑒𝑥𝑖) are collected into 𝑠𝑖
𝑆 = 𝑆 ∪ 𝑠𝑖
end
for 𝑖 = 1, 2, , 𝑀 do
Aux = maxValue
for 𝑗 = 1, 2, , sizeOf(𝑠𝑖) do
if ‖𝑥𝑗 − 𝑞𝑖‖
2
< Aux
Aux = ‖𝑥𝑗 − 𝑞𝑖 ‖
2
absTime = |𝑇𝑥𝑗
− 𝑇𝑞𝑖
|
SID = 𝑆𝐼𝐷𝑥𝑗
end
end
𝑅𝑄 = 𝑅𝑄 ∪ {𝑆𝐼𝐷, 𝑎𝑏𝑠𝑇𝑖𝑚𝑒}
end
3. Finding max frequency of each Song ID of 𝑅𝑄 where they have same absolute difference
among the time sequences.
4. EXPERIMENTAL AND PERFORMANCE ANALYSIS
4.1. Database
The performance of our proposed USH method is evaluated on the Extended Ballroom dataset freely
available in [24, 25]. The dataset consists of 4,180 musical excerpts of 13 genres with a length of 30 seconds
each. The audio quality of this data is 44.1kHz, 192-kbps, stereo, mp3 format. In this work, the audio signal
is downsampled to 8KHz. to make the data easier to handle as previously mentioned. The training set
(also used as reference database for retrieval) is composed of 3,000 tracks from 8 genres, the same rhythm
class as our previous works [26, 27]. A set of 1,000 audio queries with a length of 10 seconds each are

 ISSN: 2088-8708
888
randomly selected from those 3,000 tracks. Another set of 200 audio queries comes from audio files that
do not appear in the database, in order to analyze the false positive rate. Each audio sample is represented by
a 20-dimensional feature vector extracted by fingerprint algorithm. Table 2 shows the number of samples in
the database and the query set.
Table 2. Audio samples in the database and query set
Set Number of tracks Length of segment (s) Number of samples
Database 3,000 30 441,184
Query 1,200 10 67,785
4.2. Performance evaluation
4.2.1. Effectiveness of retrieval
On a total of 1,200 audio queries, the retrieval results obtained from our proposed USH method and
state-of-the-art data-independent method, the Shazam algorithm, are shown in Table 3. The false negative
(FN) refers to the incorrect identification that the query audio does not exist in the database when it does,
true positive (TP) refers to the correct identification of the audio recording from the query, false positive (FP)
refers to the incorrect identification of the wrong recording when the correct recording does not exist in the
database, and true negative (TN) refers to the correct identification that no audio recording matches the query.
According to the experimental results, we obtain higher percentage of accuracy (88.92%) for the proposed USH
than state-of-the-art data-independent method, the Shazam algorithm (71.67% for 16-bit hash code, 87.42%
for 20-bit hash code). Figure 10 shows the retrieval accuracycomparison between the two different methods.
Table 3. Retrieval results comparison between USH and state-of-the-art data-independent method
Method FN TP FP TN
USH 16-bit 114 886 19 181
Shazam 16-bit 290 710 50 150
Shazam 20-bit 132 868 19 181
Figure 10. The retrieval accuracy of the proposed USH and the Shazam algorithm
The effectiveness of the USH is evaluated through the experiments and compared with state-of
the-art data-independent method in terms of precision, recall, F1 score, and false positive rate [28], as follows:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
× 100 (20)

𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃+𝐹𝑁
× 100 (21)
F1 𝑠𝑐𝑜𝑟𝑒 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 x 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
(22)
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 =
𝐹𝑃
𝐹𝑃+𝑇𝑁
× 100 (23)
As can be seen in Table 4, we obtain higher precision and recall values for the proposed USH than
state-of-the-art data-independent method both in 16-bit and 20-bit hash code. The F1 score (in the fourth
column) shows the overall effectiveness of two different methods. Furthermore, the USH has a significantly

889
lower percentage of false positives (9.50%) than state-of-the-art data-independent method (25.00%) for
the same 16-bit hash code. Its shows the superior performance of the USH on short codes.
Table 4. Effectiveness comparison between USH and state-of-the-art data-independent method
Method Precision Recall F1 score % False positive
USH 16-bit 97.90 88.60 93.02 9.50
Shazam 16-bit 93.42 71.00 80.68 25.00
Shazam 20-bit 97.86 86.80 92.00 9.50
4.2.2. Storage cost
As a result of our proposed USH method, the 20-bit fingerprint features can be mapped into the 16-bit
binary code. Hence, it significantly reduces the database size by 16-fold, resulting in higher search performance.
4.3. Discussion
The performance of the proposed USH for large audio database retrieval is evaluated through
the experiments and compared with state-of-the-art data-independent hashing method, the Shazam algorithm,
on a test set of 3,000 audio recordings. The experimental results support the effectiveness of the USH with
high precision and recall values at 97.90% and 88.60% respectively. For the satisfactory results, the hash
codes produced from our proposed method have similarity preserving property, i.e., the similarity items are
mapped to the same hash code, the dissimilarity items are mapped to another one. The data-independent
methods do not take into account for this property.
The Shazam algorithm has higher percentage of false positives (25.00%) than the USH (9.50%) for
the same 16-bit hash code. It shows that the Shazam algorithm is more likely to give incorrect identifications
for the short-length codes and that is of inferior performance in audio retrieval. Furthermore, if the database
size is increased tremendously in the future, it is most likely that the Shazam algorithm would result
a significant number of false positive matches.
Let’s consider the collection S of the USH and the Shazam algorithm which effect to the accuracy of
audio retrieval. For the Shazam algorithm, the collection S consists of the data items where the search
algorithm tries to find only the matching items for those search queries regardless of similarity preserving,
and this may result in losing a number of relevant data. For the collection S of our proposed USH method
consists of candidate data items where the search algorithm focuses on the similarity between the search
queries and the items in the database. As shown in Table 5, with the collection S of the USH, the Song ID 48
is correctly identified by the two data items with the smallest distance (distance=1) at the same time offset.
For the Shazam algorithm, the relevant song cannot be retrieved.
Table 5. Example of the collection S of the proposed USH and the Shazam algorithm
Method Song ID Time offset Distance Number of item(s)
USH 16-bit 150 128 16 2
1041 239 28 2
936 -232 32 2
48* 152 1 2
306 228 48 2
2453 516 24 2
690 3720 32 1
690 2726 32 1
Shazam 16-bit 2732 164 - 2
847 19 - 2
684 2871 - 1
674 4475 - 1
675 2584 - 1
676 1564 - 1
676 4739 - 1
677 1820 - 1
Shazam 20-bit 1174 52 - 1
1176 557 - 1
1230 218 - 1
706 363 - 1
340 395 - 1
48 153 - 1
93 448 - 1
139 453 - 1
* Refers to the system correctly identifies the audio recording

 ISSN: 2088-8708
890
The major factors for our superiority are that 1) the similarity-preserving hash codes produced from
our proposed USH method, and 2) the audio retrieval algorithm proposed composes of 2 metrics, one is
the 𝐿2-norm, and the other is absolute offset time difference. These factors increase the ability to identify
the candidate items according to the similarity level of audio sample and the songs in the reference database.
And this significantly improves the retrieval performance.
5. CONCLUSION
In this paper, an unsupervised similarity-preserving hashing (USH) method for content-based audio
retrieval is proposed. We develop a deep network with multiple hierarchical layers of nonlinear and linear
transformations to learn compact hash codes where the similarity between samples is preserved. The independence
and balance properties are included and optimized in the objective function to improve the codes.
The experimental results on the Extended Ballroom dataset show the superiority of our proposed method
over state-of-the-art data-independent method. It is suggested future work should be focused on extending
USH to supervised hashing by leveraging the semantic labels to enhance the retrieval performance.
REFERENCES
[1] P. Grosche, M. Müller, and J. Serrà, “Audio Content-Based Music Retrieval,” Multimodal Music Processing.
Dagstuhl Follow-Ups, vol. 3, pp. 157-174, 2012.
[2] J. Haitsma, and T. Kalker, “A highly robust audio fingerprinting system with an efficient search strategy,” Journal
of New Music Research, vol. 32, no. 2, pp. 211-222, 2003.
[3] A. L. Wang, “An Industrial-Strength Audio Search Algorithm,” 4th International Conference on Music
Information Retrieval (ISMIR 2003), pp. 7-13, 2003.
[4] P. Panyapanuwat, S. Kamonsantiroj, and L. Pipanmaekaporn, “Time-Frequency Ratio Hashing for Content-Based
Audio Retrieval,” 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi,
pp. 205-210, 2017.
[5] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” 25th International
Conference on Very Large Data Bases (VLDB’99), pp. 518-529, 1999.
[6] B. Kulis, P. Jain, and K. Grauman, “Fast Similarity Search for Learned Metrics,” In: IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2143-2157, 2009.
[7] M. Raginsky, and S. Lazebnik, “Locality-Sensitive Binary Codes from Shift-Invariant Kernels,” 23rd Annual
Conference on Neural Information Processing Systems (NIPS’09), pp. 1509-1517, 2009.
[8] Y. Zheng, J. Zhuand, W. Fangand, and L.-H. Chi, “Deep Learning Hash for Wireless Multimedia Image Content
Security,” Journal of Security and Communication Networks, vol. 2018, pp. 1-13, 2018.
[9] J. Wang, W. Liu, S. Kumar, and S.-F. Chang, “Learning to Hash for Indexing Big Data-A Survey,” Proceedings of
the IEEE, vol. 104, no. 1, pp. 34-57, 2016.
[10] Y. Weiss, A. Torralba, and R. Fergus, “Spectral Hashing,” 21st International Conference on Neural Information
Processing Systems (NIPS’08), pp. 1753-1760, 2008.
[11] B. Kulis, and K. Grauman, “Kernelized Locality-Sensitive Hashing for Scalable Image Search,” 2009 IEEE 12th
International Conference on Computer Vision (ICCV), Kyoto, pp. 2130-2137, 2009.
[12] Y. Gong, and S. Lazebnik, “Iterative Quantization: A Procrustean Approach to Learning Binary Codes,” IEEE
Conference on Computer Vision and Pattern recognition (CVPR 2011), Providence, RI, pp. 817-824, 2011.
[13] B. Kulis, and T. Darrell, “Learning to Hash with Binary Reconstructive Embeddings,” 22nd International
[14] W. Liu, J. Wang, R. Ji, Y. G. Jiang, and S. F. Chang, “Supervised Hashing with Kernels,” 2012 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, pp. 2074-2081, 2012.
[15] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming Distance Metric Learning,” 25th International
[16] J. Wang, S. Kumar, and S. F. Chang, “Semi-Supervised Hashing for large-Scale Search,” In IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 34, no. 12, pp. 2393-2406, 2012.
[17] F. Shen, C. Shen, W. Liu W, and H. T. Shen, “Supervised Discrete Hashing,” IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2015), pp. 37-45, 2015.
[18] X. Bai, H. Yang, J. Zhou, P. Ren, and J. Cheng, “Data-dependent Hashing Based on p-Stable Distribution,” IEEE
Transactions on Image Processing, vol. 23, no. 12, pp. 5033-5046, 2014.
[19] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A Survey on Learning to Hash,” In IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 769-790, 2018.
[20] J. He, S. F. Chang, R. Radhakrishnan, and C. Bauer, “Compact hashing with joint optimization of search accuracy
and time,” IEEEConference on Computer Vision and Pattern Recognition (CVPR2011),Providence,RI,pp. 753-760, 2011.
[21] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A Review of Audio Fingerprinting,” Journal of VLSI Signal
Processing, vol. 41, pp. 271-284, 2005.
[22] Dan Ellis, “Robust Landmark-Based Audio Fingerprinting,” 2015. [Online]. Available:
https://labrosa.ee.columbia.edu/matlab/fingerprint/.

891
[23] T. T. Do, A. D. Doan, and N. M. Cheung, “Learning to Hash with Binary Deep Neural Network,” 14th European
Conference on Computer Vision (ECCV 2016), pp. 219-234, 2016.
[24] U. Marchand, and G. Peeters. “Scale and shift invariant time/frequency representation using auditory statistics:
Application to rhythm description,” 2016 IEEE 26th International Workshop on Machine Learning for Signal
Processing (MLSP), Vietri sul Mare, pp. 1-6, 2016.
[25] U. Marchand, and G. Peeters, “The Extended Ballroom Dataset,” 17th International Society for Music Information
Retrieval Conference (ISMIR 2016) Late-Breaking Session, New-York, USA, pp. 1-3, 2016.
[26] P. Panyapanuwat, S. Kamonsantiroj, and L. Pipanmaekaporn, “Unsupervised Learning Hash for Content-Based
Audio Retrieval Using Deep Neural Networks,” 2019 11th International Conference on Knowledge and Smart
Technology (KST), Phuket, Thailand, pp. 99-104, 2019.
[27] P. Panyapanuwat, and S. Kamonsantiroj, “Performance Comparison of Unsupervised Deep Hashing with Data-
independent Hashing for Content-Based Audio Retrieval,” 2019 2nd International Conference on Electronics,
Communications and Control Engineering, pp. 16-20, 2019.
[28] C. Manning, P. Raghavan, and H. Schütze, “An Introduction to Information Retrieval,” Cambridge University
Press, 2009.
BIOGRAPHIES OF AUTHORS
Petcharat Panyapanuwat holds a bachelor’s in mathematics and a master’s degree in software
engineering. She is currently a Ph.D. candidate at Department of Computer and Information
Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand.
Her current research interest focuses on music information retrieval.
Suwatchai Kamonsantiroj is currently a lecturer at Department of Computer and Information
He holds a bachelor’s degree in mechanical engineering and a master’s degree in information
technology management. He also earned his doctoral degree in computer engineering from
Kasetsart University, Thailand, graduating in 2008. His current research interests include neural
network, time series analysis, and artificial intelligence.
Luepol Pipanmaekaporn is currently a lecturer at Department of Computer and Information
He holds both a bachelor’s and a master’s degree in computer science. He also earned his
doctoral degree in computer science from Queensland University of Technology, Australia,
graduating in 2013. His current research interests include information retrieval, web mining, and
data mining.

Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks

More Related Content

What's hot (20)

Similar to Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks (20)

More from IJECEIAES (20)

Recently uploaded (20)

Similarity-preserving hash for content-based audio retrieval using unsupervised deep neural networks