An Efficient Privacy-Preserving Ranked
Keyword Search Method
Chi Chen, Member, IEEE, Xiaojie Zhu, Student Member, IEEE, Peisong Shen, Student Member, IEEE,
Jiankun Hu, Member, IEEE, Song Guo, Senior Member, IEEE, Zahir Tari, Senior Member, IEEE, and
Albert Y. Zomaya, Fellow, IEEE
Abstract—Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it
is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship between documents
will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also
the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertext
search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a
hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertext search
within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold,
and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search
phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order
to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been
conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset
the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially.
Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved
documents.
Index Terms—Cloud computing, ciphertext search, ranked search, multi-keyword search, hierarchical clustering, security
Ç
1 INTRODUCTION
AS we step into the big data era, terabyte of data are pro-
duced world-wide per day. Enterprises and users who
own a large amount of data usually choose to outsource
their precious data to cloud facility in order to reduce data
management cost and storage facility spending. As a result,
data volume in cloud storage facilities is experiencing a
dramatic increase. Although cloud server providers (CSPs)
claim that their cloud service is armed with strong security
measures, security and privacy are major obstacles prevent-
ing the wider acceptance of cloud computing service [1].
A traditional way to reduce information leakage is data
encryption. However, this will make server-side data utili-
zation, such as searching on encrypted data, become a very
challenging task. In the recent years, researchers have
proposed many ciphertext search schemes [34], [35], [36],
[37], [43] by incorporating the cryptography techniques.
These methods have been proven with provable security,
but their methods need massive operations and have high
time complexity. Therefore, former methods are not suitable
for the big data scenario where data volume is very big and
applications require online data processing. In addition, the
relationship between documents is concealed in the above
methods. The relationship between documents represents
the properties of the documents and hence maintaining the
relationship is vital to fully express a document. For exam-
ple, the relationship can be used to express its category. If a
document is independent of any other documents except
those documents that are related to sports, then it is easy for
us to assert this document belongs to the category of the
sports. Due to the blind encryption, this important property
has been concealed in the traditional methods. Therefore,
proposing a method which can maintain and utilize this
relationship to speed the search phase is desirable.
On the other hand, due to software/hardware failure,
and storage corruption, data search results returning to the
users may contain damaged data or have been distorted by
the malicious administrator or intruder. Thus, a verifiable
mechanism should be provided for users to verify the cor-
rectness and completeness of the search results.
In this paper, a vector space model is used and every doc-
ument is represented by a vector, which means every docu-
ment can be seen as a point in a high dimensional space.
Due to the relationship between different documents, all the
 C. Chen, X. Zhu and P. Shen is with the State Key Laboratory Of
Information Security, Institute of Information Engineering, Chinese
Academy of Sciences, Beijing, China.
E-mail: {chenchi, zhuxiaojie, shenpeisong}@iie.ac.cn.
 J. Hu is with the Cyber Security Lab, School of Engineering and IT,
University of New South Wales at the Australian Defence Force Academy,
Canberra, ACT 2600, Australia. E-mail: J.Hu@adfa.edu.au.
 S. Guo is with the School of Computer Science and Engineering, The
University of Aizu, Japan. E-mail: sguo@u-aizu.ac.jp.
 Z. Tari is with the School of Computer Science, RMIT University,
Australia. E-mail: zahir.tari@rmit.edu.au.
 A. Y. Zomaya is with the School of Information Technologies, The
University of Sydney, Australia. E-mail: albert.zomaya@sydney.edu.au.
Manuscript received 29 Sept. 2014; revised 8 Apr. 2015; accepted 8 Apr. 2015.
Date of publication 21 Apr. 2015; date of current version 16 Mar. 2016.
Recommended for acceptance by R. Kwok.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPDS.2015.2425407
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016 951
1045-9219 ß 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
documents can be divided into several categories. In other
words, the points whose distance are short in the high
dimensional space can be classified into a specific category.
The search time can be largely reduced by selecting the
desired category and abandoning the irrelevant categories.
Comparing with all the documents in the dataset, the num-
ber of documents which user aims at is very small. Due to
the small number of the desired documents, a specific cate-
gory can be further divided into several sub-categories.
Instead of using the traditional sequence search method, a
backtracking algorithm is produced to search the target
documents. Cloud server will first search the categories and
get the minimum desired sub-category. Then the cloud
server will select the desired k documents from the mini-
mum desired sub-category. The value of k is previously
decided by the user and sent to the cloud server. If current
sub-category can not satisfy the k documents, cloud server
will trace back to its parent and select the desired documents
from its brother categories. This process will be executed
recursively until the desired k documents are satisfied or the
root is reached. To verify the integrity of the search result, a
verifiable structure based on hash function is constructed.
Every document will be hashed and the hash result will be
used to represent the document. The hashed results of docu-
ments will be hashed again with the category information
that these documents belong to and the result will be used to
represent the current category. Similarly, every category
will be represented by the hash result of the combination of
current category information and sub-categories informa-
tion. A virtual root is constructed to represent all the data
and categories. The virtual root is denoted by the hash result
of the concatenation of all the categories located in the first
level. The virtual root will be signed so that it is verifiable.
To verify the search result, user only needs to verify the vir-
tual root, instead of verifying every document.
2 EXISTING SOLUTIONS
In recent years, searchable encryption which provides text
search function based on encrypted data has been widely
studied, especially in security definition, formalizations
and efficiency improvement, e.g. [2], [3], [4], [5], [6], [7].
As shown in Fig. 1, the proposed method is compared
with existing solutions and has the advantage in main-
taining the relationship between documents.
2.1 Single Keyword Searchable Encryption
Song et al. [2] first introduced the notion of searchable
encryption. They propose to encrypt each word in the docu-
ment independently. This method has a high searching cost
due to the scanning of the whole data collection word by
word. Goh [8] formally defined a secure index structure and
formulate a security model for index known as semantic
security against adaptive chosen keyword attack (ind-cka).
They also developed an efficient ind-cka secure index con-
struction called z-idx by using pseudo-random functions
and bloom filters. Cash et al. [41] recently design and imple-
ment an efficient data structure. Due to the lack of rank
mechanism, users have to take a long time to select what
they want when massive documents contain the query key-
word. Thus, the order-preserving techniques are utilized to
realize the rank mechanism, e.g. [9], [10], [11]. Wang et al.
[12] use encrypted invert index to achieve secure ranked
keyword search over the encrypted documents. In the
search phase, the cloud server computes the relevance score
between documents and the query. In this way, relevant
documents are ranked according to their relevance score
and users can get the top-k results. In the public key setting,
Boneh et al. [3] designed the first searchable encryption con-
struction, where anyone can use public key to write to the
data stored on server but only authorized users owning pri-
vate key can search. However, all the above mentioned tech-
niques only support single keyword search.
2.2 Multiple Keyword Searchable Encryption
To enrich search predicates, a variety of conjunctive key-
word search methods (e.g. [7], [13], [14], [15], [16]) have been
proposed. These methods show large overhead, such as
communication cost by sharing secret, e.g. [14], or computa-
tional cost by bilinear map, e.g.[7]. Pang et al. [17] propose a
secure search scheme based on vector space model. Due to
the lack of the security analysis for frequency information
and practical search performance, it is unclear whether their
scheme is secure and efficient or not. Cao et al. [18] present a
novel architecture to solve the problem of multi-keyword
ranked search over encrypted cloud data. But the search
time of this method grows exponentially accompanying
with the exponentially increasing size of the document
collections. Sun et al. [19] give a new architecture which
achieves better search efficiency. However, at the stage of
index building process, the relevance between documents is
ignored. As a result, the relevance of plaintexts is concealed
by the encryption, users expectation cannot be fulfilled well.
For example: given a query containing Mobile and Phone,
only the documents containing both of the keywords will be
retrieved by traditional methods. But if taking the semantic
relationship between the documents into consideration, the
documents containing Cell and Phone should also be
retrieved. Obviously, the second result is better at meeting
the users expectation.
2.3 Verifiable Search Based on Authenticated Index
The idea of data verification has been well studied in the
area of databases. In a plaintext database scenario, a variety
of methods have been produced, e.g. [20], [21], [22]. Most of
these works are based on the original work by Merkle [23],
[24] and refinements by Naor and Nissm [25] for certificate
revocation. Merkle hash tree and cryptographic signature
techniques are used to construct authenticated tree struc-
ture upon which end users can verify the correctness and
completeness of the query results.
Fig. 1. Architecture of ciphertext search.
952 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
Pang and Mouratidis [26] apply the Merkle hash tree
based on authenticated structure to text search engines.
However, they only focus on the verification-specific issues
ignoring the search privacy preserving capabilities that will
be addressed in this paper.
The hash chain is used to construct a single keyword
search result verification scheme by Wang et al. [9]. Sun et al.
[19] use Merkle hash tree and cryptographic signature to
create a verifiable MDB-tree. However, their work cannot be
directly used in our architecture which is oriented for pri-
vacy-preserving multiple keyword search. Thus, a proper
mechanism that can be used to verify the search results within
big data scenario is essential to both the CSPs and end users.
3 OUR CONTRIBUTION
In this paper, we propose a multi-keyword ranked search
over encrypted data based on hierarchical clustering index
(MRSE-HCI) to maintain the close relationship between dif-
ferent plain documents over the encrypted domain in order
to enhance the search efficiency. In the proposed architec-
ture, the search time has a linear growth accompanying with
an exponential growing size of data collection. We derive
this idea from the observation that users retrieval needs usu-
ally concentrate on a specific field. So we can speed up
the searching process by computing relevance score between
the query and documents which belong to the same specific
field with the query. As a result, only documents which are
classified to the field specified by users query will be evalu-
ated to get their relevance score. Due to the irrelevant fields
ignored, the search speed is enhanced.
We investigate the problem of maintaining the close rela-
tionship between different plain documents over an
encrypted domain and propose a clustering method to solve
this problem. According to the proposed clustering method,
every document will be dynamically classified into a specific
cluster which has a constraint on the minimum relevance
score between different documents in the dataset. The
relevance score is a metric used to evaluate the relationship
between different documents. Due to the new documents
added to a cluster, the constraint on the cluster may be bro-
ken. If one of the new documents breaks the constraint, a
new cluster center will be added and the current document
will be chosen as a temporal cluster center. Then all the docu-
ments will be reassigned and all the cluster centers will be
reelected. Therefore, the number of clusters depends on the
number of documents in the dataset and the close relation-
ship between different plain documents. In other words, the
cluster centers are created dynamically and the number of
clusters is decided by the property of the dataset.
We propose a hierarchical method in order to get a better
clustering result within a large amount of data collection.
The size of each cluster is controlled as a trade-off between
clustering accuracy and query efficiency. According to the
proposed method, the number of clusters and the minimum
relevance score increase with the increase of the levels
whereas the maximum size of a cluster reduces. Depending
on the needs of the grain level, the maximum size of a cluster
is set at each level. Every cluster needs to satisfy the con-
straints. If there is a cluster whose size exceeds the limitation,
this cluster will be divided into several sub-clusters.
We design a search strategy to improve the rank pri-
vacy. In the search phase, the cloud server will first com-
pute the relevance score between query and cluster
centers of the first level and then chooses the nearest clus-
ter. This process will be iterated to get the nearest child
cluster until the smallest cluster has been found. The
cloud server computes the relevance score between query
and documents included in the smallest cluster. If the
smallest cluster can not satisfy the number of desired
documents which is previously decided by user, cloud
server will trace back to the parent cluster of the smallest
cluster and the brother clusters of the smallest cluster will
be searched. This process will be iterated until the num-
ber of desired documents is satisfied or the root is
reached. Due to the special search procedures, the rank-
ings of documents among their search results are differ-
ent with the rankings derived from traditional sequence
search. Therefore, the rank privacy is enhanced.
Some part of the above work has been presented in
[27]. For further improvement, we also construct a verifi-
able tree structure upon the hierarchical clustering
method to verify the integrity of the search result in this
paper. This authenticated tree structure mainly takes the
advantage of the Merkle hash tree and cryptographic sig-
nature. Every document will be hashed and the hash
result will be used as the representative of the document.
The smallest cluster will be represented by the hash result
of the combination of the concatenation of the documents
included in the smallest cluster and own category infor-
mation. The parent cluster is represented by the hash
result of the combination of the concatenation of its chil-
dren and own category information. A virtual root is
added and represented by the hash result of the concate-
nation of the categories located in the first level. In addi-
tion, the virtual root will be signed so that user can
achieve the goal of verifying the search result by verifying
the virtual root.
In short, our contributions can be summarized as follows:
1) We investigate the problem of maintaining the close
relationship between different plain documents over
an encrypted domain and propose a clustering
method to solve this problem.
2) We proposed the MRSE-HCI architecture to speed
up server-side searching phase. Accompanying with
the exponential growth of document collection, the
search time is reduced to a linear time instead of
exponential time.
3) We design a search strategy to improve the rank pri-
vacy. This search strategy adopts the backtracking
algorithm upon the above clustering method. With
the growing of the data volume, the advantage of
the proposed method in rank privacy tends to be
more apparent.
4) By applying the Merkle hash tree and cryptographic
signature to authenticated tree structure, we provide
a verification mechanism to assure the correctness
and completeness of search results.
The organization of the following parts of the paper is
as follows: Section 4 describes the system model, threat
model, design goals and notations. The architecture and
CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 953
detailed algorithm are displayed in Section 5. We discuss
the efficiency and security of MRSE-HCI scheme in Section
6. An evaluation method is provided in Section 7. Section 8
demonstrates the result of our experiments. Section 9
concludes the paper.
4 DEFINITION AND BACKGROUND
4.1 System Model
The system model contains three entities, as illustrated in
Fig. 1, the data owner, the data user, and the cloud server.
The box with dashed lines in the figure indicates the added
component to the existing architecture.
The data owner is responsible for collecting documents,
building document index and outsourcing them in an
encrypted format to the cloud server. Apart from that, the
data user needs to get the authorization from the data owner
before accessing to the data. The cloud server provides a
huge storage space, and the computation resources needed
by ciphertext search. Upon receiving a legal request from the
data user, the cloud server searches the encrypted index, and
sends back top-k documents that are most likely to match
users query [11]. The number k is properly chosen by the
data user. Our system aims at protecting data from leaking
information to the cloud server while improving the effi-
ciency of ciphertext search.
In this model, both the data owner and the data user are
trusted, while the cloud server is semi-trusted, which is con-
sistent with the architecture in [9], [18], [28]. In other words,
the cloud server will strictly follow the predicated order and
try to get more information about the data and the index.
4.2 Threat Model
The adversarys ability can be concluded in two threat
models.
Known ciphertext model. In this model, Cloud server can get
encrypted document collection, encrypted data index, and
encrypted query keywords.
Known background model. In this model, cloud server
knows more information than that in known ciphertext
model. Statistical background information of dataset, such
as the document frequency and term frequency information
of a specific keyword, can be used by the cloud server to
launch a statistical attack to infer or identify specific key-
word in the query [9], [10], which further reveals the plain-
text content of documents. The adversarys ability can be
represented in the above two threat models.
4.3 Design Goals
 Search efficiency. The time complexity of search time
of the MRSE-HCI scheme needs to be logarithmic
against the size of data collection in order to deal
with the explosive growth of document size in big
data scenario.
 Retrieval accuracy. Retrieval precision is related to
two factors: the relevance between the query and the
documents in result set, and the relevance of docu-
ments in the result set.
 Integrity of the search result. The integrity of the
search results includes three aspects:
1) Correctness. All the documents returned from
servers are originally uploaded by the data
owner and remain unmodified.
2) Completeness. No qualified documents are omit-
ted from the search results.
3) Freshness. The returned documents are the latest
version of documents in the dataset.
 Privacy requirements. We set a series of privacy
requirements which current researchers mostly
focus on.
1) Data privacy. Data privacy presents the confi-
dentiality and privacy of documents. The adver-
sary cannot get the plaintext of documents
stored on the cloud server if data privacy is
guaranteed. Symmetric cryptography is a con-
ventional way to achieve data privacy.
2) Index privacy. Index privacy means the ability to
frustrate the adversary attempt to steal the infor-
mation stored in the index. Such information
includes keywords and the TF (Term Frequency)
of keywords in documents, the topic of docu-
ments, and so on.
3) Keyword privacy. It is important to protect users
query keywords. Secure query generation algo-
rithm should output trapdoors which leak no
information about the query keywords.
4) Trapdoor unlinkability. Trapdoor unlinkability
means that each trapdoor generated from the
query is different, even for the same query. It can
be realized by integrating a random function in
the trapdoor generation process. If the adversary
can deduce the certain set of trapdoors which all
corresponds to the same keyword, he can calculate
the frequency of this keyword in search request in
a certain period. Combined with the document
frequency of keyword in known background
model, he/she can use statistical attack to identify
the plain keyword behind these trapdoors.
5) Rank privacy. Rank order of search results
should be well protected. If the rank order
remains unchanged, the adversary can compare
the rank order of different search results, further
identify the search keyword.
4.4 Notations
In this paper, notations presented in Table 1 are used.
5 ARCHITECTURE AND ALGORITHM
5.1 System Model
In this section, we will introduce the MRSE-HCI scheme.
The vector space model adopted by the MRSE-HCI scheme
is same as the MRSE [18], while the process of building
index is totally different. The hierarchical index structure is
introduced into the MRSE-HCI instead of sequence index.
In MRSE-HCI, every document is indexed by a vector.
Every dimension of the vector stands for a keyword and the
value represents whether the keyword appears or not in the
document. Similarly, the query is also represented by a vec-
tor. In the search phase, cloud server calculates the rele-
vance score between the query and documents by
954 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
computing the inner product of the query vector and docu-
ment vectors and return the target documents to user
according to the top k relevance score.
Due to the fact that all the documents outsourced to the
cloud server is encrypted, the semantic relationship
between plain documents over the encrypted documents is
lost. In order to maintain the semantic relationship between
plain documents over the encrypted documents, a cluster-
ing method is used to cluster the documents by clustering
their related index vectors. Every document vector is
viewed as a point in the n-dimensional space. With the
length of vectors being normalized, we know that the dis-
tance of points in the n-dimensional space reflect the rele-
vance of corresponding documents. In other word, points of
high relevant documents are very close to each other in the
n-dimensional space. As a result, we can cluster the docu-
ments based on the distance measure.
With the volume of data in the data center has experienced
a dramatic growth, conventional sequence search approach
will be very inefficient. To promote the search efficiency, a
hierarchical clustering method is proposed. The proposed
hierarchical approach clusters the documents based on the
minimum relevance threshold at different levels, and then
partitions the resulting clusters into sub-clusters until the con-
straint on the maximum size of cluster is reached. Upon
receiving a legal request, cloud server will search the related
indexes layer by layer instead of scanning all indexes.
5.2 MRSE-HCI Architecture
MRSE-HCI architecture is depicted by Fig. 2, where the
data owner builds the encrypted index depending on the
dictionary, random numbers and secret key, the data user
submits a query to the cloud server for getting desired
documents, and the cloud server returns the target docu-
ments to the data user. This architecture mainly consists
of following algorithms.
 Keygenð1lðnÞ
Þ ! ðsk; kÞ. It is used to generate the
secret key to encrypt index and documents.
 IndexðD; skÞ ! I. Encrypted index is generated in
this phase by using the above mentioned secret key.
At the same time, clustering process is also included
current phase.
 EncðD; kÞ ! E. The document collection is
encrypted by a symmetric encryption algorithm
which achieves semantic security.
 Trapdoorðw; skÞ ! Tw. It generates encrypted query
vector Tw with users input keywords and secret key.
 SearchðTw; I; ktopÞ ! ðIw; EwÞ. In this phase, cloud
server compares trapdoor with index to get the top-k
retrieval results.
 DecðEw; kÞ ! Fw. The returned encrypted documents
are decrypted by the key generated in the first step.
The concrete functions of different components is
described as below.
1) Keygenð1lðnÞ
Þ. The data owner randomly generates
a ðn þ u þ 1Þ bit vector S where every element is a
integer 1 or 0 and two invertible ðn þ u þ 1ÞÂ
ðn þ u þ 1Þmatrices whose elements are random inte-
gers as secret key sk. The secret key k is generated by
the data owner choosing an n-bit pseudo sequence.
2) IndexðD; skÞ. As show in the Fig. 3, the data owner
uses tokenizer and parser to analyze every docu-
ment and gets all keywords. Then data owner uses
the dictionary Dw to transform documents to a
collection of document vectors DV . Then the data
owner calculates the DC and CCV by using a qual-
ity hierarchical clustering (QHC) method which
will be illustrated in section C. After that, the data
owner applies the dimension-expanding and
TABEL 1
Notations
di The ith document vector, denoted as
di ¼ fdi;1; . . . ; di;ng, where di;j represents whether
the jth
keyword in the dictionary appears in
document di.
m The number of documents in the data collection.
n The size of dictionary DW .
CCV The collection of cluster centers vectors, denoted as
CCV ¼ fc1; . . . ; cng, where ci is the average vector
of all document vectors in the cluster.
CCVi The collection of the ith level cluster center vectors,
denoted as CCVi ¼ fvi;1; . . . ; vi;ng where Vi;j
represents the jth vector in the ith level.
DC The information of documents classification such
as document id list of a certain cluster.
DV The collection of document vectors, denoted as
DV ¼ fd1; d2; . . . ; dmg.
DW The dictionary, denoted as Dw ¼ fw1; w2; . . . ; wng.
Fw The ranked id list of all documents according to
their relevance to keyword w.
Ic The clustering index which contains the encrypted
vectors of cluster centers.
Id The traditional index which contains encrypted
document vectors.
Li The minimum relevance score between different
documents in the ith level of a cluster.
QV The query vector.
TH A fixed maximum number of documents in a
cluster.
Tw The encrypted query vector for users query.
Fig. 2. MRSE-HCI architecture.
Fig. 3. Algorithm index.
CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 955
vector-splitting procedure to every document vec-
tor. It is worth noting that CCV is treated equally
as DV . For dimension-expanding, every vector in
DV is extended to ðn þ u þ 1Þ bit-long, where the
value in nþ jð0 j uÞ dimension is an integer
number generated randomly and the last dimen-
sion is set to 1. For vector-splitting, every extended
document vector is split into two ðn þ u þ 1Þ bit-
long vectors, V 0
and V 00
with the help of the
ðn þ u þ 1Þbit vector S as a splitting indicator. If
the ith element of S (Si ) is 0, then we set
V 00
i ¼ V 0
i ¼ Vi ; If ith element of S (Si ) is 1, then V 00
i
is set to a random number and V 0
i ¼ Vi À V 00
i .
Finally, the traditional index Id is encrypted as
Id ¼ fMT
1 V 0
; MT
2 V 00
gby using matrix multiplication
with the sk, and Ic is generated in a similar way.
After this, Id ,Ic , and DC are outsourced to the
cloud server.
3) EncðD; kÞ. The data owner adopts a secure symmet-
ric encryption algorithm (e.g. AES) to encrypt the
plain document set D and outsources it to the cloud
server.
4) Trapdoorðw; skÞ. The data user sends the query to
the data owner who will later analyze the query
and builds the query vector QV by analyzing the
keywords of query with the help of dictionary DW ,
QV then is extended to a ðn þ u þ 1Þ bit query
vector. Subsequently,v random positions chosen
from a range ðn; n þ uŠ in QV are set to 1, others
are set to 0.The value at last dimension of QV is
set to a random number t½0; 1Š. Then the first
ðn þ uÞdimensions of QW , denoted as qw, is scaled
by a random number rðr 6¼ 0Þ ,Qw ¼ ðr Á qw; tÞ .
After that, Qw is split into two random vectors as
fQ0
W ; Q00
W g with vector-splitting procedure which is
similar to that in the IndexðD; skÞ phase. The dif-
ference is that if the ith bit of S is 1, then we have
q0
i ¼ q00
i ¼ qi; If the ith bit of S is 0, q0
i is set as a
random number and q00
i ¼ qi À q0
i. Finally, the
encrypted query vector Tw is generated as Tw ¼
fMÀ1
1 Q0
w; MÀ1
2 Q00
wg and sent back to the data user.
5) SearchðTw; I; ktopÞ. Upon receiving the Tw from data
user, the cloud server computes the relevance score
between Tw and index Ic and then chooses
the matched cluster which has the highest rele-
vance score. For every document contained in the
matched cluster, the cloud server extract its corre-
sponding encrypted document vector in Id , and
calculates its relevance score S with Tw , as
described in the Equation (1). Finally, these scores
of documents in the matched cluster are sorted and
the top ktop documents are returned by the cloud
server. The detail will be discussed in the Section 5.5.
S ¼ Tw Á Ic
¼ fMÀ1
1 Q0
w; MÀ1
2 Q00
wg Á fMT
1 V 0
; MT
2 V 00
g
¼ Q0
w Á V 0
þ Q0
w Á V 00
¼ Qw Á V:
(1)
6) DecðEw; kÞ. The data user utilizes the secret key k to
decrypt the returned ciphertext Ew.
5.3 Relevance Measure
In this paper, the concept of coordinate matching [29] is
adopted as a relevance measure. It is used to quantify the
relevance of document-query and document-document. It
is also used to quantify the relevance of the query and
cluster centers. Equation (2) defines the relevance score
between document di and query qw . Equation (3) defines
the relevance score between query qw and cluster center
lci;j . Equation (4) defines the relevance score between
document di and dj.
Sqdi ¼
Xnþuþ1
t¼1
ðqw;t  di;tÞ (2)
Sqci ¼
Xnþuþ1
t¼1
ðqw;t  lci;j;tÞ (3)
Sddi ¼
Xnþuþ1
t¼1
ðdi;t  dj;tÞ: (4)
5.4 Quality Hierarchical Clustering Algorithm
So far, a lot of hierarchical clustering methods has been
proposed. However all of these methods are not compara-
ble to the partition clustering method in terms of time
complexity performance. K-means [30] and K-medois [31]
are popular partition clustering algorithms. But the k is
fixed in the above two methods, which can not be applied
to the situation of dynamic number of cluster centers. We
propose a quality hierarchical clustering algorithm based
on the novel dynamic K-means.
As the proposed dynamic K-means algorithm shown in
the Fig. 4, the minimum relevance threshold of the clus-
ters is defined to keep the cluster compact and dense. If
the relevance score between a document and its center
is smaller than the threshold, a new cluster center is
added and all the documents are reassigned. The above
procedure will be iterated until k is stable. Comparing
with the traditional clustering method, k is dynamically
changed during the clustering process. This is why it is
called dynamic K-means algorithm .
The QHC algorithm is illustrated in the Fig. 5. It goes
like that. Every cluster will be checked on whether its
Fig. 4. Algorithm dynamic k-means.
Fig. 5. Algorithm quality hierarchical clustering (QHC).
956 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
size exceeds the maximum number TH or not. If the
answer is “yes”, this “big” cluster will be split into child
clusters which are formed by using the dynamic K-means
on the documents of this cluster. This procedure will be
iterated until all clusters meet the requirement of maxi-
mum cluster size. Clustering procedure is illustrated in
Fig. 6. All the documents are denoted as points in a coor-
dinate system. These points are initially partitioned into
two clusters by using dynamic K-means algorithm when
the k ¼ 2. These two bigger clusters are depicted by the
elliptical shape. Then these two clusters are checked to
see whether their points satisfy the distance constraint.
The second cluster does not meet this requirement, thus
a new cluster center is added with k ¼ 3 and the dynamic
K-means algorithm runs again to partition the second clus-
ter into two parts. Then the data owner checks whether
these clusters size exceed the maximum number TH.
Cluster 1 is split into two sub-clusters again due to its big
size. Finally all points are clustered into four clusters as
depicted by the rectangle.
5.5 Search Algorithm
The cloud server needs to find the cluster that most matches
the query. With the help of cluster index Ic and document
classification DC , the cloud server uses an iterative proce-
dure to find the best matched cluster. Following instance
demonstrates how to get matched one:
1) The cloud server computes the relevance score
between Query Tw and encrypted vectors of the first
level cluster centers in cluster index Ic, then chooses
the ith cluster center Ic;1;i which has the highest score.
2) The cloud server gets the child cluster centers of the
cluster center, then computes the relevance score
between Tw and every encrypted vectors of child
cluster centers, and finally gets the cluster center Ic;2;i
with the highest score. This procedure will be iter-
ated until that the ultimate cluster center Ic;l;j in last
level l is achieved.
In the situation depicted by Fig. 7, there are nine docu-
ments which are grouped into three clusters. After calculat-
ing the relevance score with trapdoor Tw , cluster 1, which is
shown within the box of dummy line in Fig. 7, is found to be
the best match. Documents d1,d3 ,d9 belong to cluster 1, then
their encrypted document vectors in the Id are extracted out
to compute the relevance score with Tw.
5.6 Search Result Verification
The retrieved data have high possibility to be wrong since
the network is unstable and the data may be damaged due
to the hardware/software failure or malicious administrator
or intruder. Verifying the authenticity of search results is
emerging as a critical issue in the cloud environment. We,
therefore, designed a signed hash tree to verify the correct-
ness and freshness of the search results.
 Building. The data owner builds the hash tree based
on the hierarchical index structure. The algorithm
shown in the Fig. 8 is described as follows. The hash
value of the leaf node of the tree is hðid k version k
FðidÞÞ where id means document id, version means
document version and FðidÞ means the document
contents. The value of non-leaf node is a pair of values
ðid; hðid k hchildÞÞ where id denotes the value of the
cluster center or document vector in the encrypted
index, and hchild is the hash value of its child node.
The hash value of tree root node is based on the hash
values of all clusters in the first level. It is worth not-
ing that the root node denotes the data set which
contains all clusters. Then the data owner generates
the signature of the hash values of the root node and
outsources the hash tree including the root signature
to the cloud server. Cryptographic signature s (e.g.,
RSA signature, DSA signature) can be used here to
authenticate the hash value of root node.
 Processing. By the algorithm shown in the Fig. 9, the
cloud server returns the root signature and the mini-
mum hash sub-tree (MHST) to client. The minimum
hash sub-tree includes the hash values of leaf nodes
in the matched cluster and non-leaf node corre-
sponding to all cluster centers used to find the
matched cluster in the searching phase. For example,
in the Fig. 10, the search result is document D, E and
F. Then the leaf nodes are D, E, F and G, and non-
leaf nodes includes C1, C2, C3, C4, dD, dE, dF , and dG.
In addition, the root is included in the non-leaf node.
 Verifying. The data owner uses the minimum hash
sub-tree to re-compute the hash values of nodes, in
particular the root node which can be further verified
by the root signature. If all nodes are matched, then
Fig. 7. Retrieval process.
Fig. 8. Algorithm building-minimum hash sub-tree.
Fig. 9. Algorithm processing-minimum hash sub-tree.
Fig. 6. Clustering process.
CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 957
the correctness and freshness is guaranteed. Then the
data owner re-searches the index constructed by
retrieved values in MHST. If the search result is same
as the retrieved result, the completeness, correctness
and freshness all are guaranteed.
As shown in the Fig. 10, in the building phase, all docu-
ments are clustered into two big clusters and four small clus-
ters, and each big cluster contains two small clusters. The
hash value of leaf node A is hðidA k version k FðidAÞÞ , the
value of the non-leaf node C3 is ðidC3
; hðidC3
k hA k hB k hCÞÞ,
and the value of non-leaf node C1 is ðidC1
; hðidC1
k hC3
k
hC4
ÞÞ. The other values of leaf nodes and non-leaf nodes are
generated similarly. In order to combine all first-level clusters
into a tree, a virtual root node is created by the data owner
with a hash value hðhC1;2
k hC2;2
Þ where C1;2 and C2;2 denotes
the second part of cluster center 1 and 2 respectively. Then the
data owner signs the root node, e.g., sðhðhC1;2
k hC2;2
ÞÞ ¼
ðhC1;2
k hC2;2
; eðhðhC1;2
k hC2;2
ÞÞk
; gÞ, and outsources it to the
cloud server.
In the processing phase, suppose that the cluster C4 is the
matched cluster and the returned top-three documents are
D, E, and F. Then the minimum hash sub-tree includes the
hash values of node D, E, F, dD, dE, dF , dG, C3, C2, C1, C4
and the signed root sðhðhC1;2
k hC2;2
ÞÞ.
In the verifying phase, upon receiving the signed root, the
data user first check eðhðhC1;2
k hC2;2
Þ; gÞk
¼
?
eðsigkhðhC1;2
k
hC2;2
Þ; gÞ . If it is not true, the retrieved hash tree is not
authentic, otherwise the returned nodes, D, E, F, dD, dE, dF ,
dG, C3, C2, C1, C4, works together to verify each other and
reconstruct the hash tree. If all the nodes are authenticate,
the returned hash tree are authenticate. Then the data user
re-computes the hash value of the leaf nodes D, E and F by
using returned documents. These new generated hash val-
ues are compared with the corresponding returned hash
values. If there is no difference, the retrieved documents is
correct. Finally, the data user uses the trapdoor to re-search
the index constructed by the first part of retrieved nodes. If
the search result is same as the retrieved result, the search
result is complete.
5.7 Dynamic Data Collection
As the documents stored at server may be deleted or modi-
fied and new documents may be added to the original data
collection, a mechanism which supports dynamic data collec-
tion is necessary. A naive way to address these problems is
downloading all documents and index locally and updating
the data collection and index. However, this method needs
huge cost in bandwidth and local storage space.
To avoid updating index frequently, we provide a practi-
cal strategy to deal with insertion, deletion and modification
operations. Without loss of generality, we use following
examples to illustrate the workings of the strategy. The data
owner preserves many empty entries in the dictionary for
new documents. If a new document contains new key-
words, the data owner first adds these new keywords to the
dictionary and then constructs a document vector based on
the new dictionary. The data owner sends the trapdoor gen-
erated by the document vector, encrypted document and
encrypted document vector to the cloud sever. The cloud
sever finds the closest cluster, and puts the encrypted docu-
ment and encrypted document vector into it.
As every cluster has a constraint on the maximum size, it
is possible that the number of documents in a cluster
exceeds the limitation due to the insertion operation. In this
case, all the encrypted document vectors belonging to the
broken cluster are returned to the data owner. After decryp-
tion of the retrieved document vectors, the data owner re-
builds the sub-index based on the deciphered document
vectors. The sub-index is re-encrypted and re-outsourced to
the cloud server.
Upon receiving a deletion order, the cloud server
searches the target document. Then the cloud server deletes
the document and the corresponding document vector.
Modifying a document can be described as deleting the old
version of the document and inserting the new version. The
operation of modifying documents, therefore, can be realized
by combining insertion operation and deletion operation.
To deal with this impact on the hash tree, a lazy update
strategy is designed. For the insertion operation, the corre-
sponding hash value will be calculated and marked as a
raw node, while the original nodes in the hash tree will be
kept unchanged because the original hash tree still supports
document verification except the new document. Only
when the new added document is accessed, the hash tree
will be updated. Similar concept is used in the deletion
operation. The only difference is that the deletion operation
will not bring the hash tree update.
6 EFFICIENCY AND SECURITY
6.1 Search Efficiency
The search process can be divided into Trapdoorðw; skÞ
phase and SearchðTw; I; ktopÞ phase. The number of opera-
tion needed in Trapdoorðw; skÞ phase is illustrated as in
Equation (5), where, n is the number of keywords in the dic-
tionary, and w is the number of query keywords,
OðMRSE À HCIÞ ¼ 5n þ u À v À w þ 5: (5)
Due to the time complexity of Trapdoorðw; skÞ phase
independent to DC, when DC increases exponentially,it
can be described as O(1).
The difference of the search process between the MRSE-
HCI and the MRSE is the retrieval algorithm used in this
phase. In the SearchðTw; I; ktopÞ phase of the MRSE, the
cloud server needs to compute the relevance score between
the encrypted query vector Tw and all encrypted document
vectors in Id , and get the top-k ranked document list Fw .
The number of operations need in SearchðTW ; I; ktopÞ phase
is illustrated as in Equation (6), where m represents the
Fig. 10. Authentication for hierarchical clustering index.
958 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
number of documents in DC ,and n represents the number
of keywords in the dictionary,
OðMRSEÞ ¼ 2m à ð2n þ 2u þ 1Þ þ m À 1: (6)
However, in the SearchðTW ; I; ktopÞ phase of MRSE-HCI, the
cloud server uses the information DC to quickly locate the
matched cluster and only compares Tw to a limited number
of encrypted document vectors in Id . The number of opera-
tions needed in SearchðTW ; I; ktopÞ phase is illustrated in
Equation (7), where ki represents the number of cluster cen-
ters needed to be compared with in the ith level, and c repre-
sents the number of document vectors in the matched
cluster,
OðMRSE À HCIÞ ¼
Xl
i¼1
ki
!
à 2 à ð2n þ 2u þ 1Þ
þ cð2 Ã ð2n þ 2u þ 1ÞÞ þ c À 1:
(7)
When DC increases exponentially, m can be set to 2l
. The
time complexity of the traditional MRSE is Oð2l
Þ , while the
time complexity of the proposed MRSE-HCI is only OðlÞ.
The total search time can be calculated as given in
Equation (8) below, where OðtrapdoorÞ is Oð1Þ ,and OðqueryÞ
relies on the DC,
OðsearchTimeÞ ¼ OðtrapdoorÞ þ OðqueryÞ: (8)
In short, when the number of documents in DC has an expo-
nential growth, the search time of MRSE-HCI increases line-
arly while the traditional methods increase exponentially.
6.2 Security Analysis
To express the security analysis briefly, we adopt some con-
cepts from [37], [38], [39] and define what kinds of informa-
tion will be leaked to the curious-but-honest server.
The basic information of documents and queries are
inevitably leaked to the honest-but-curious server since all
the data are stored at the server and the queries submitted
to the server. Moreover, the access pattern and search pat-
tern cannot be preserved in MRSE-HCI as well as previous
searchable encryption [18], [38], [39], [40].
Definition 1 (Size pattern). Let D be a document collection.
The size pattern induced by a q-query is a tuple
aðD; QÞ ¼ ðm; jQ1j; . . . ; jQqjÞ where m is the number of docu-
ments and jQij is the size of query Qi.
Definition 2 (Access pattern). Let D be a document collection
and I be an index over D. The access pattern induced by a
q-query is a tuple bðD; QÞ ¼ ðIðQ1Þ; IðQqÞÞ, where IðQiÞ is a
set of identifiers returned by query Qi, for 1 i q.
Definition 3 (Search pattern). Let D be a document collection.
The search pattern induced by a q-query is a m  q binary
matrix cðD; QÞ such that for 1 i m and 1 j q the ele-
ment in the ith row and jth column is 1, if an document iden-
tifier idi is returned by a query Qj.
Definition 4 (known ciphertext model secure). Let
P ¼ ðKeygen; Index; Enc; Trapdoor; Search; DecÞ be an
index-based MRSE-HCI scheme over dictionary Dw, n 2 N,
be the security parameter, the known ciphertext model secure
experiment PrivKkcm
A;P ðnÞ is described as follows.
1) The adversary submits two document collections D0
and D1 with the same length to a challenger.
2) The challenger generates a secret key fsk; kg by run-
ning Keygenð1lðnÞ
Þ.
3) The challenger randomly choose a bit b 2 f0; 1g, and
returns IndexðDb; skbÞ ! Ib and EncðDb; kbÞ ! Eb
to the adversary.
4) The adversary outputs a bit b0
5) The output of the experiment is defined to be 1 if b0
¼ b,
and 0 otherwise.
We say MRSE-HCI scheme is secure under known cipher-
text model if for all probabilistic polynomial-time adversar-
ies A there exists a negligible function neglðnÞ such that
PrðPrivkkcm
A;P ¼ 1Þ 1=2 þ neglðnÞ: (9)
Proof. The adversary A distinguishes the document collec-
tions depending on analyzing the secret key, index
and encrypted document collection. Then we have
Equation (10), where AdvðADðsk; kÞÞ is the advantage for
adversary A to distinguish the secret key from two
random matrixes and two random strings, AdvðADðIÞÞ is
the advantage to distinguish the index from a random
string and AdvðADðEÞÞ is the advantage to distinguish
the encrypted documents from random strings.
PrðPrivKkcm
A;P ðnÞ ¼ 1Þ ¼ 1=2
þ AdvðADðsk; kÞÞ þ AdvðADðIÞÞ þ AdvðADðEÞÞ
(10)
tu
The elements of two matrixes in the secret key are
randomly chosen from f0; 1glðnÞ
, and the split indicator S
and key k are also chosen uniformly at random from
f0; 1glðnÞ
. Given f0; 1glðnÞ
, A distinguishes the secret key
from two random matrixes and two random strings with
a negligible probability. Then there exits a negligible
function negl1ðnÞ such that
AdvðADðsk; kÞÞ ¼ jPrðKeygenð1lðnÞ
Þ ! ðsk; kÞÞ
À PrðRandom ! ðskr; krÞÞj negl1ðnÞ;
(11)
where skr denotes two random matrixes and a random
string, and kr is a random string. In our scheme, the encryp-
tion of hierarchical index is essential to encrypt all the docu-
ment vectors and cluster center vectors. All the cluster
center vectors are treated as document vectors in the
encryption phase. Eventually, all the document vectors and
cluster center vectors are encrypted by the secure KNN. As
the secure KNN is known plaintext attack (KPA) secure
[32], the hierarchical index is secure under the known
ciphertext model. Then there exists a negligible function
negl2ðnÞ satisfying that
AdvðADðIÞÞ ¼ jPrðIndexðD; skÞ ! ðIÞÞ
À PrðRandom ! ðIrÞÞj negl2ðnÞ;
(12)
where Ir is a random string.
Since the encryption algorithm used to encrypt Db is
semantic secure, the encrypted documents are secure under
CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 959
known ciphertext model. Then there exists a negligible func-
tion negl3ðnÞ such that
AdvðADðEÞÞ ¼ jPrðEncðD; kÞ ! ðEÞÞ
À PrðRandom ! ðErÞÞj negl3ðnÞ:
(13)
Where Er is a random string set.
According Equations (10), (11), (12) and (13), we can get
Equation (14),
PrðPrivkkcm
A;P ¼ 1Þ 1=2
þ negl1ðnÞ þ negl2ðnÞ þ negl3ðnÞ
(14)
neglðnÞ ¼ negl1ðnÞ þ negl2ðnÞ þ negl3ðnÞ (15)
PrðPrivkkcm
A;PÞ 1=2 þ neglðnÞ: (16)
By combining Equations (14) and (15), we can conclude
Equation (16). Then, we say MRSE-HCI is secure under
know ciphertext model.
7 EVALUATION METHOD
7.1 Search Precision
The search precision can quantify the users satisfaction. The
Retrieval precision is related to two factors: the relevance
between documents and the query, and the relevance of
documents between each other. Equation (17) defines the
relevance between retrieved documents and the query,
Pq ¼
Xk0
i¼1
Sðqw; diÞ=
Xk
i¼1
Sðqw; diÞ
!
: (17)
Here, k0
denotes the number of files retrieved by the evalu-
ated method, k denotes the number of files retrieved by
plain text search, qw represents query vector, di represents
document vector, and S is a function to compute the rele-
vance score between qw and di. Equation (18) defines the rel-
evance of different retrieved documents,
Pd ¼
Xk0
j¼1
Xk0
i¼1
Sðdj; diÞ=
Xk
j¼1
Xk
i¼1
Sðdj; diÞ
!
: (18)
Here, k0
denotes the number of files retrieved by the evalu-
ated method, k denotes the number of files retrieved by
plaintext search, and both di and dj denote document vector.
Equation (19) combines the relevance between query and
retrieved documents and relevance of documents to quan-
tify the search precision such that
Acc ¼ aPq þ Pd; (19)
where a functions as a tradeoff parameter to balance the rel-
evance between query and documents and relevance of
documents. If a is smaller than 1, it puts more emphasis on
the relevance of documents otherwise query keywords.
The above evaluation strategies should be based on the
same dataset and keywords.
7.2 Rank Privacy
Rank privacy can quantify the information leakage of the
search results. The definition of rank privacy is adopted
from [18]. Equation (20) is used to evaluate the rank privacy,
Pk ¼
Xk
i¼1
Pi=k: (20)
Here, k denotes the number of top-k retrieved docu-
ments, pi ¼ ci0 À cij j , ci0 is the ranking of document di in the
retrieved top-k documents,ci is the actual ranking of docu-
ment di in the data set, and Pi is set to k if greater than k .
The overall rank privacy measure at point k, denoted as Pk,
is defined as the average value of pi for every document di
in the retrieved top-k documents.
8 PERFORMANCE ANALYSIS
In order to test the performance of MRSE-HCI on real data-
set, we built an experimental platform to test the search effi-
ciency, accuracy and rank privacy. We implemented the
target experiment based on a distributed platform which
includes three ThinkServer RD830 and a ThinkCenter M8400t.
The data set is built from IEEE Xplore, including about
51;000 documents, and 22;000 keywords.
According to the notations defined in Section 4, n denotes
the dictionary size, k denotes the number of top-k docu-
ments, m denotes the number of documents in the data set,
and w denotes the number of keywords in the users query.
Fig. 11 is used to describe search efficiency with different
conditions. Fig. 11a describes search efficiency using the dif-
ferent size of document set with unchanged dictionary size,
number of retrieved documents and number of query key-
words, n ¼ 22;157; k ¼ 20; w ¼ 5. In Fig. 11b, we adjust the
value of k with unchanged dictionary size, document set
size and number of query keywords, n ¼ 22;157;
m ¼ 51;312; w ¼ 5. Fig. 11c tests the different number of
query keywords with unchanged dictionary size, document
set size and number of retrieved documents, n ¼ 22;157;
m ¼ 51;312; k ¼ 20.
From the Fig. 11a, we can observe that with the exponen-
tial growth of document set size, the search time of MRSE
increases exponentially, while the search time of
MRSE À HCI increases linearly. As the Figs. 11b and 11c
shows, the search time of MRSE À HCI keeps stable with
the increase of query keywords and retrieved documents.
Meanwhile, the search time is far below that of MRSE.
Fig. 11. Search efficiency.
960 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
Fig. 12 describes search accuracy by utilizing plaintext
search as a standard. Fig. 12a illustrates the relevance of
retrieved documents. With the number of documents
increases from 3;200 to 51;200, the ratio of MRSE-to-plaintext
search fluctuates at 1, while MRSE-HCI-to-plaintext search
increases from 1:5 to 2. From the Fig. 12a, we can observe
that the relevance of retrieved documents in the MRSE-HCI
is almost twice as many as that in the MRSE, which means
retrieved documents generated by MRSE-HCI are much
closer to each other. Fig. 12b shows the relevance between
query and retrieved documents. With the size of document
set increases from 3;200 to 51;200, the MRSE-to-plaintext
search ratio fluctuates at 0:75. MRSE-HCI-to-plaintext search
ratio increases from 0:65 to 0:75 accompanying with the
growth of document set size. From the Fig. 12b, we can see
that the relevance between query and retrieved documents
in MRSE-HCI is slightly lower than that in MRSE. Espe-
cially, this gap narrows when the data size increases since a
big document data set has a clear category distribution
which improves the relevance between query and docu-
ments. Fig. 12c shows the rank accuracy according to
Equation (19). The tradeoff parameter a is set to 1, which
means there is no bias towards relevance of documents or
relevance between documents and query. From the result,
we can conclude that MRSE-HCI is better than MRSE in
rank accuracy.
Fig. 13 describes the rank privacy according to
Equation (20). In this test, no matter the number of retrieved
documents, MRSE À HCI has better rank privacy than
MRSE. This mainly caused by the relevance of documents
introduced into search strategy.
9 CONCLUSION
In this paper, we investigated ciphertext search in the sce-
nario of cloud storage. We explore the problem of maintain-
ing the semantic relationship between different plain
documents over the related encrypted documents and give
the design method to enhance the performance of the
semantic search. We also propose the MRSE-HCI architec-
ture to adapt to the requirements of data explosion, online
information retrieval and semantic search. At the same
time, a verifiable mechanism is also proposed to guarantee
the correctness and completeness of search results. In
addition, we analyze the search efficiency and security
under two popular threat models. An experimental plat-
form is built to evaluate the search efficiency, accuracy, and
rank security. The experiment result proves that the pro-
posed architecture not only properly solves the multi-key-
word ranked search problem, but also brings an
improvement in search efficiency, rank security, and the rel-
evance between retrieved documents.
ACKNOWLEDGMENTS
This work was supported by Strategic Priority Re-
search Program of Chinese Academy of Sciences
(No. XDA06040601) and Xinjiang Uygur Autonomous
Region science and technology plan (No. 201230121). An
early version of this paper is presented at the Workshop on
Security and Privacy in Big Data at IEEE INFOCOM 2014
[27]. Extensive enhancements have been made which
includes incorporating a novel verification scheme to help
data user verify the authenticity of the search results, and
adding a security analysis as well more details of the pro-
posed scheme. This work was supported by Strategic Priority
Research Program of Chinese Academy of Sciences
(No. XDA06010701) and National High Technology Research
and Development Program of China(No. 2013AA01A24).
REFERENCES
[1] S. Grzonkowski, P. M. Corcoran, and T. Coughlin, “Security anal-
ysis of authentication protocols for next-generation mobile and
CE cloud services,” in Proc. IEEE Int. Conf. Consumer Electron.,
2011, Berlin, Germany, 2011, pp. 83–87.
[2] D. X. D. Song, D. Wagner, and A. Perrig, “Practical techniques for
searches on encrypted data,” in Proc. IEEE Symp. Security Priv.,
BERKELEY, CA, 2000, pp. 44–55.
[3] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public
key encryption with keyword search,” in Proc. EUROCRYPT,
Interlaken, SWITZERLAND, 2004, pp. 506–522.
[4] Y. C. Chang and M. Mitzenmacher, “Privacy preserving key-
word searches on remote encrypted data,” in Proc. 3rd Int.
Conf. Applied Cryptography Netw. Security, New York, NY, 2005,
pp. 442–455.
[5] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchable
symmetric encryption: improved definitions and efficient
constructions,” in Proc. 13th ACM Conf. Comput. Commun. Security,
Alexandria, Virginia, 2006, pp. 79–88.
[6] M. Bellare, A. Boldyreva, and A. O’Neill, “Deterministic and effi-
ciently searchable encryption,” in Proc. 27th Annu. Int. Cryptol.
Conf. Adv. Cryptol., Santa Barbara, CA, 2007, pp. 535–552.
[7] D. Boneh and B. Waters, “Conjunctive, subset, and range queries
on encrypted data,” in Proc. 4th Conf. Theory Cryptography,
Amsterdam, NETHERLANDS, 2007, pp. 535–554.
[8] E.-J. Goh, Secure Indexes, IACR Cryptology ePrint Archive,
vol. 2003, pp. 216. 2003.
[9] C. Wang, N. Cao, K. Ren, and W. J. Lou, “Enabling secure and effi-
cient ranked keyword search over outsourced cloud data,” IEEE
Trans. Parallel Distrib. Syst., vol. 23, no. 8, pp. 1467–1479, Aug.
2012.
Fig. 12. Search precision.
Fig. 13. Rank privacy.
CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 961
[10] A. Swaminathan, Y. Mao, G. M. Su, H. Gou, A. Varna, S. He,
M. Wu, and D. Oard, “Confidentiality-preserving rank-ordered
search,” in Proc. ACM ACM Workshop Storage Security Survivability,
Alexandria, VA, 2007, pp. 7–12.
[11] S. Zerr, D. Olmedilla, W. Nejdl, and W. Siberski, “Zerber+R: Top-
k retrieval from a confidential index,” in Proc. 12th Int. Conf.
Extending Database Technol.: Adv. Database Technol., Saint
Petersburg, Russia, 2009, pp. 439–449.
[12] C. Wang, N. Cao, J. Li, K. Ren, and W. J. Lou, “Secure ranked key-
word search over encrypted cloud data,” in Proc. IEEE 30th Int.
Conf. Distrib. Comput. Syst., Genova, ITALY, 2010, pp. 253–262.
[13] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keyword
search over encrypted data,” in Proc. Proc. 2nd Int. Conf. Appl.
Cryptography Netw. Security, Yellow Mt, China, 2004, pp. 31–45.
[14] L. Ballard, S. Kamara, and F. Monrose, “Achieving efficient con-
junctive keyword searches over encrypted data,” in Proc. 7th Int.
Conf. Inform. Commun. Security, Beijing, China, 2005, pp. 414–426.
[15] R. Brinkman, “Searching in encrypted data” in University of
Twente, PhD thesis, 2007.
[16] Y. H. Hwang and P. J. Lee, “Public key encryption with conjunc-
tive keyword search and its extension to a multi-user system,” in
Proc. 1st Int. Conf. Pairing-Based Cryptography, Tokyo, JAPAN,
2007, pp. 2–22.
[17] H. Pang, J. Shen, and R. Krishnan, “Privacy-preserving similarity-
based text retrieval,” ACM Trans. Internet Technol., vol. 10, no. 1,
pp. 39, Feb. 2010.
[18] N. Cao, C. Wang, M. Li, K. Ren, and W. J. Lou, “Privacy-preserving
multi-keyword ranked search over encrypted cloud data,” in Proc.
IEEE INFOCOM, Shanghai, China, 2011, pp. 829–837.
[19] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and H. Li,
“Privacy-preserving multi-keyword text search in the cloud sup-
porting similarity-based ranking,” in Proc. 8th ACM SIGSAC
Symp. Inform., Comput. Commun. Security, Hangzhou, China, 2013,
pp. 71–82.
[20] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, “Dynamic
authenticated index structures for outsourced databases,” in Proc.
ACM SIGMOD, Chicago, IL, 2006, pp. 121–132.
[21] H. H. Pang and K. L. Tan, “Authenticating query results in edge
computing,” in Proc. 20th Int. Conf. Data Eng., Boston, MA , 2004,
pp. 560–571.
[22] C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong, and S. G.
Stubblebine, “A general model for authenticated data structures,”
Algorithmica, vol. 39, no. 1, pp. 21–41, May 2004.
[23] C. M. Ralph, “Protocols for public key cryptosystems,” in Proc.
IEEE Symp. Security Priv, Oakland, CA, 1980, pp. 122–122.
[24] R. C. Merkle, “A certified digital signature,” in Proc. Adv. cryptol. ,
1990, vol. 435, pp. 218–238.
[25] M. Naor and K. Nissim, “Certificate revocation and certificate
update,” IEEE J. Sel. Areas Commun., vol. 18, no. 4, pp. 561–570,
Apr. 2000.
[26] H. Pang and K. Mouratidis, “Authenticating the query results
of text search engines,” in Proc. VLDB Endow., vol. 1, no. 1,
pp. 126–137, Aug. 2008.
[27] C. Chen, X. J. Zhu, P. S. Shen, and J. K. Hu, “A hierarchical cluster-
ing method For big data oriented ciphertext search,” in Proc. IEEE
INFOCOM, Workshop on Security and Privacy in Big Data, Toronto,
Canada, 2014, pp. 559–564.
[28] S. C. Yu, C. Wang, K. Ren, and W. J. Lou, “Achieving secure, scal-
able, and fine-grained data access control in cloud computing,” in
Proc. IEEE INFOCOM, San Diego, CA, 2010, pp. 1–9.
[29] I. H. Witten, A. Moffat, and T. C. Bell, Managing Gigabytes:
Compressing and Indexing Documents and Images, 2nd ed. San
Francisco, CA, USA : Morgan Kaufmann, 1999.
[30] J. MacQueen, “Some methods for classification and analysis of
multivariate observations,” in Proc. Berkeley Symp. Math. Stat.
Prob., California, 1967, p. 14.
[31] Z. X. Huang, “Extensions to the k-means algorithm for clustering
large data sets with categorical values,” Data Min. Knowl. Discov.,
vol. 2, no. 3, pp. 283–304, Sep. 1998.
[32] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, “Secure
kNN computation on encrypted databases,” in Proc. ACM SIG-
MOD Int. Conf. Manage. Data, Providence, RI, 2009, pp. 139–152.
[33] R. X. Li, Z. Y. Xu, W. S. Kang, K. C. Yow, and C. Z. Xu, “Efficient
Multi-keyword ranked query over encrypted data in cloud com-
puting, Futur. Gener. Comp. Syst., vol. 30, pp. 179–190, Jan. 2014.
[34] G. Craig, “Fully homomorphic encryption using ideal lattices,” in
Proc. 41st Annu. ACM Symp. Theory Comput., 2009, vol. 9, pp. 169–178
[35] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public
key encryption with keyword search[C],” in Proc. Adv. Cryptol.,
Berlin, Heidelberg, 2004, pp. 506–522.
[36] D. Cash, S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M.
Steiner, “Highly-scalable searchable symmetric encryption with
support for Boolean que-ries,” in Proc. Adv. Cryptol,. Berlin, Hei-
delberg, 2013, pp. 353–373.
[37] S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchable
symmetric encryption,” in Proc. Conf. Comput. Commun. Secur.,
2012, pp. 965–976.
[38] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchable
symmetric encryption: Improved definitions and efficient
construc-tions,” in Proc. 13th ACM Conf. Comput. Commun. Secur.,
2006, pp. 79–88.
[39] M. Chase and S. Kamara, “Structured encryption and controlled
disclosure,” in Proc. Adv. Cryptol., 2010, pp. 577–594.
[40] D. Cash, J. Jaeger, S. Jarecki, C. Jutla, H. Krawczyk, M. C. Rosu,
and M. Steiner, “Dynamic searchable encryption in very large
databases: Data structures and implementation,” in Proc. Netw.
Distrib. Syst. Security Symp., vol. 14, 2014, Doi: http://dx.doi.org/
10.14722/ndss.2014.23264.
[41] S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M. Steiner,
“Outsourced symmetric private information retrieval,” in Proc.
ACM SIGSAC Conf. Comput. Commun. Secur., Nov. 2013, pp. 875–888.
Chi Chen received the BS and MS degrees from
Shandong University, Jinan, China, in 2000 and
2003, respectively, and the PHD degree from
the Institute of Software Chinese Academy of Sci-
ences, Beijing, China in 2008. He is an associate
research fellow of the Institute of Information
Engineering, Chinese Academy of Sciences. His
research interest includes the cloud security and
database security. From 2003 to 2011, he was a
research apprentice, research assistant, and
associate research fellow with the State Key Lab-
oratory of Information Security, institute of software Chinese Academy of
Sciences. Since 2012, he is an associate research fellow with the State
Key Laboratory of Information Security, institute of information engineer-
ing, Chinese Academy of Sciences, Beijing, China. He is a member of
the IEEE.
Xiaojie Zhu received the BS degree in the
Zhejiang University of Technology, HangZhou,
China, in 2011. He is currently working towards
the MS degree in the Institute of Information
Engineering, Chinese Academy of Sciences.
His research interest includes the information
retrieval, secure cloud storage, and data security.
He is a student member of the IEEE.
Peisong Shen received the BS degree in the
University of Science and Technology of China,
HeFei, China, in 2012. He is currently working
towards the PhD degree in the Institute of Informa-
tion Engineering, Chinese Academy of Sciences.
His research interest includes the information
retrieval, secure cloud storage, and data security.
He is a student member of the IEEE.
Jiankun Hu is a professor and research director
of the Cyber Security Lab, School of Engineering
and IT, The University of New South Wales, Can-
berra, Australia. His main research interest is
Cyber security with focus on bio-cryptography,
and anomaly intrusion detection. He has obtained
seven ARC (Australian Research Council) Grants
and has served at the prestigious Panel of mathe-
matics, information and computing sciences,
ARC ERA Evaluation Committee. He is a mem-
ber of the IEEE.
962 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
Song Guo received the PhD degree in computer
science from University of Ottawa, Canada. He is
currently a Full Professor at School of Computer
Science and Engineering, the University of Aizu,
Japan. His research interests are mainly in the
areas of wireless communication and mobile
computing, cyber-physical systems, data center
networks, cloud computing and networking, big
data, and green computing. He has published
over 250 papers in referred journals and confer-
ences in these areas and received three IEEE/
ACM best paper awards. Dr. Guo currently serves as Secretary of IEEE
ComSoc Technical Committee on Satellite and Space Communication
(TCSSC) and Technical Subcommittee on Big Data (TSCBD), Associate
Editor of IEEE Transactions on Parallel and Distributed Systems
(TPDS), IEEE Transactions on Emerging Topics (TETC) for the Compu-
tational Networks Track, and on editorial boards of many others. He has
also been in organizing and technical committees of numerous interna-
tional conferences and workshops. Dr. Guo is a senior member of the
IEEE and the ACM.
Zahir Tari received the degree in mathematics
from the University of Science and Technology
Houari Boumediene, Bab-Ezzouar, Algeria, in
1984, the masters degree in operational research
from the University of Grenoble, Grenoble,
France, in 1985, and the PhD degree in computer
science from the University of Grenoble, in 1989.
He is a professor, in distributed systems, at RMIT
University, Melbourne, Australia. Later, he joined
the Database Laboratory at EPFL (Swiss Federal
Institute of Technology, 1990-1992) and then
moved to QUT (Queensland University of Technology, 1993-1995) and
RMIT (Royal Melbourne Institute of Technology, since 1996). He is the
head of the DSN (Distributed Systems and Networking) at the School of
Computer Scienceand IT, where he pursues high-impact research and
development in computer science. He leads a few research groups that
focus on some of the core areas, including networking (QoS routing,
TCP/IP congestion), distributed systems (performance, security, mobility,
reliability), and distributed applications (SCADA, Web/Internet applica-
tions, mobile applications).His recent research interests are in perfor-
mance (in Cloud) and security (in SCADA systems). He regularly
publishes in prestigious journals (like IEEE Transactions on Parallel and
Distributed Systems, IEEE Trans-actions on Web Services, ACM Trans-
actions on Databases) and conferences (ICDCS, WWW, ICSOC, etc.).
He co-authored two books (John Wiley) and edited more than 10 books.
He has been the program committee chair of several international
conferences, including the DOA (Distributed Object and Appli-cation
Symposium), IFIP DS 11.3 on Database Security, and IFIP 2.6 on Data
Semantics. He has also been the general chair of more than 12
conferences. He is the recipient of 14 ARC (Australian ResearchCouncil)
grants. He is a senior member of the IEEE.
Albert Y. Zomaya is currently the chair
professor of high-performance computing and
networking and Australian Research Council
professorial fellow in the School of Information
Technologies, The University of Sydney,
Sydney, Australia. He is also the director of the
Centre for distributed and high-performance
computing which was established in late 2009.
He is the author/co-author of seven books,
more than 370 papers, and the editor of nine
books and 11 conference proceedings. He is
the editor in chief of the IEEE Transactions on Computers and serves
as an associate editor for 19 leading journals. He is the recipient of
the Meritorious Service Award in 2000 and the Golden Core Recogni-
tion in 2006, both from the IEEE Computer Society. He is a chartered
engineer (CEng), a fellow of the AAAS, the IEEE, the IET (UK), and a
distinguished engineer of the ACM.
 For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 963

More Related Content

PDF
An efficeient privacy preserving ranked keyword search
PDF
Efficient Similarity Search Over Encrypted Data
PDF
IRJET- Multiple Keyword Search over Encrypted Cloud Data
PDF
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
PDF
IRJET - K-Gram based Composite Secret Sign Search Over Encrypted Cloud In...
PDF
Enabling efficient multi keyword ranked
PDF
Classifying confidential data using SVM for efficient cloud query processing
An efficeient privacy preserving ranked keyword search
Efficient Similarity Search Over Encrypted Data
IRJET- Multiple Keyword Search over Encrypted Cloud Data
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
IRJET - K-Gram based Composite Secret Sign Search Over Encrypted Cloud In...
Enabling efficient multi keyword ranked
Classifying confidential data using SVM for efficient cloud query processing

What's hot (18)

PDF
Privacy Preserving in Cloud Using Distinctive Elliptic Curve Cryptosystem (DECC)
PDF
Efficient Similarity Search over Encrypted Data
PDF
IRJET- An Efficient Ranked Multi-Keyword Search for Multiple Data Owners Over...
PDF
Secure Multi-Keyword Top-K Retrieval Over Encrypted Cloud Data Using Homomorp...
PDF
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
PDF
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
PDF
Privacy preserving and delegated access control for cloud applications
DOCX
Secure distributed deduplication systems with improved reliability
PDF
Efficient Association Rule Mining in Heterogeneous Data Base
PDF
4.on demand quality of web services using ranking by multi criteria 31-35
PDF
11.0004www.iiste.org call for paper.on demand quality of web services using r...
PDF
IRJET- Compound Keyword Search of Encrypted Cloud Data by using Semantic Scheme
PDF
J017547478
PDF
Retrieving Secure Data from Cloud Using OTP
PDF
IRJET- Swift Retrieval of DNA Databases by Aggregating Queries
PDF
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...
PDF
An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...
Privacy Preserving in Cloud Using Distinctive Elliptic Curve Cryptosystem (DECC)
Efficient Similarity Search over Encrypted Data
IRJET- An Efficient Ranked Multi-Keyword Search for Multiple Data Owners Over...
Secure Multi-Keyword Top-K Retrieval Over Encrypted Cloud Data Using Homomorp...
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
Privacy preserving and delegated access control for cloud applications
Secure distributed deduplication systems with improved reliability
Efficient Association Rule Mining in Heterogeneous Data Base
4.on demand quality of web services using ranking by multi criteria 31-35
11.0004www.iiste.org call for paper.on demand quality of web services using r...
IRJET- Compound Keyword Search of Encrypted Cloud Data by using Semantic Scheme
J017547478
Retrieving Secure Data from Cloud Using OTP
IRJET- Swift Retrieval of DNA Databases by Aggregating Queries
E FFICIENT D ATA R ETRIEVAL F ROM C LOUD S TORAGE U SING D ATA M ININ...
An proficient and Confidentiality-Preserving Multi- Keyword Ranked Search ove...
Ad

Similar to 2016 BE Final year Projects in chennai - 1 Crore Projects (20)

PDF
IRJET- Proficient Recovery Over Records using Encryption in Cloud Computing
PDF
Enabling efficient multi keyword ranked search over encrypted mobile cloud da...
PDF
Enabling efficient multi keyword ranked search over encrypted mobile cloud da...
PDF
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
PDF
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
PDF
Data Search in Cloud using the Encrypted Keywords
PDF
Cloud java titles adrit solutions
DOCX
Privacy-Preserving Multi-keyword Top-k Similarity Search Over Encrypted Data
PDF
Efficient Privacy Preserving Clustering Based Multi Keyword Search
DOCX
Privacy-Preserving Multi-keyword Top-k Similarity Search Over Encrypted Data
DOCX
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
DOCX
A secure and dynamic multi keyword ranked
DOCX
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...
PDF
Fuzzy Keyword Search Over Encrypted Data in Cloud Computing
PDF
O01761103112
PDF
A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
DOCX
Privacy-Preserving Multi-keyword Top-k Similarity Search Over Encrypted Data
PDF
IRJET - Privacy Preserving Keyword Search over Encrypted Data in the Cloud
PPTX
Accurate and Efficient Secured Dynamic Multi-keyword Ranked Search
PDF
Ieeepro techno solutions ieee dotnet project - privacy-preserving multi-keyw...
IRJET- Proficient Recovery Over Records using Encryption in Cloud Computing
Enabling efficient multi keyword ranked search over encrypted mobile cloud da...
Enabling efficient multi keyword ranked search over encrypted mobile cloud da...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
Data Search in Cloud using the Encrypted Keywords
Cloud java titles adrit solutions
Privacy-Preserving Multi-keyword Top-k Similarity Search Over Encrypted Data
Efficient Privacy Preserving Clustering Based Multi Keyword Search
Privacy-Preserving Multi-keyword Top-k Similarity Search Over Encrypted Data
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD...
A secure and dynamic multi keyword ranked
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...
Fuzzy Keyword Search Over Encrypted Data in Cloud Computing
O01761103112
A Survey on Efficient Privacy-Preserving Ranked Keyword Search Method
Privacy-Preserving Multi-keyword Top-k Similarity Search Over Encrypted Data
IRJET - Privacy Preserving Keyword Search over Encrypted Data in the Cloud
Accurate and Efficient Secured Dynamic Multi-keyword Ranked Search
Ieeepro techno solutions ieee dotnet project - privacy-preserving multi-keyw...
Ad

Recently uploaded (20)

PPTX
History, Philosophy and sociology of education (1).pptx
PDF
IGGE1 Understanding the Self1234567891011
PDF
HVAC Specification 2024 according to central public works department
PDF
Hazard Identification & Risk Assessment .pdf
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
Empowerment Technology for Senior High School Guide
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
20th Century Theater, Methods, History.pptx
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
History, Philosophy and sociology of education (1).pptx
IGGE1 Understanding the Self1234567891011
HVAC Specification 2024 according to central public works department
Hazard Identification & Risk Assessment .pdf
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
FORM 1 BIOLOGY MIND MAPS and their schemes
Empowerment Technology for Senior High School Guide
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Weekly quiz Compilation Jan -July 25.pdf
20th Century Theater, Methods, History.pptx
Uderstanding digital marketing and marketing stratergie for engaging the digi...
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Chinmaya Tiranga quiz Grand Finale.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
B.Sc. DS Unit 2 Software Engineering.pptx
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf

2016 BE Final year Projects in chennai - 1 Crore Projects

  • 1. An Efficient Privacy-Preserving Ranked Keyword Search Method Chi Chen, Member, IEEE, Xiaojie Zhu, Student Member, IEEE, Peisong Shen, Student Member, IEEE, Jiankun Hu, Member, IEEE, Song Guo, Senior Member, IEEE, Zahir Tari, Senior Member, IEEE, and Albert Y. Zomaya, Fellow, IEEE Abstract—Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertext search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents. Index Terms—Cloud computing, ciphertext search, ranked search, multi-keyword search, hierarchical clustering, security Ç 1 INTRODUCTION AS we step into the big data era, terabyte of data are pro- duced world-wide per day. Enterprises and users who own a large amount of data usually choose to outsource their precious data to cloud facility in order to reduce data management cost and storage facility spending. As a result, data volume in cloud storage facilities is experiencing a dramatic increase. Although cloud server providers (CSPs) claim that their cloud service is armed with strong security measures, security and privacy are major obstacles prevent- ing the wider acceptance of cloud computing service [1]. A traditional way to reduce information leakage is data encryption. However, this will make server-side data utili- zation, such as searching on encrypted data, become a very challenging task. In the recent years, researchers have proposed many ciphertext search schemes [34], [35], [36], [37], [43] by incorporating the cryptography techniques. These methods have been proven with provable security, but their methods need massive operations and have high time complexity. Therefore, former methods are not suitable for the big data scenario where data volume is very big and applications require online data processing. In addition, the relationship between documents is concealed in the above methods. The relationship between documents represents the properties of the documents and hence maintaining the relationship is vital to fully express a document. For exam- ple, the relationship can be used to express its category. If a document is independent of any other documents except those documents that are related to sports, then it is easy for us to assert this document belongs to the category of the sports. Due to the blind encryption, this important property has been concealed in the traditional methods. Therefore, proposing a method which can maintain and utilize this relationship to speed the search phase is desirable. On the other hand, due to software/hardware failure, and storage corruption, data search results returning to the users may contain damaged data or have been distorted by the malicious administrator or intruder. Thus, a verifiable mechanism should be provided for users to verify the cor- rectness and completeness of the search results. In this paper, a vector space model is used and every doc- ument is represented by a vector, which means every docu- ment can be seen as a point in a high dimensional space. Due to the relationship between different documents, all the C. Chen, X. Zhu and P. Shen is with the State Key Laboratory Of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China. E-mail: {chenchi, zhuxiaojie, shenpeisong}@iie.ac.cn. J. Hu is with the Cyber Security Lab, School of Engineering and IT, University of New South Wales at the Australian Defence Force Academy, Canberra, ACT 2600, Australia. E-mail: J.Hu@adfa.edu.au. S. Guo is with the School of Computer Science and Engineering, The University of Aizu, Japan. E-mail: sguo@u-aizu.ac.jp. Z. Tari is with the School of Computer Science, RMIT University, Australia. E-mail: zahir.tari@rmit.edu.au. A. Y. Zomaya is with the School of Information Technologies, The University of Sydney, Australia. E-mail: albert.zomaya@sydney.edu.au. Manuscript received 29 Sept. 2014; revised 8 Apr. 2015; accepted 8 Apr. 2015. Date of publication 21 Apr. 2015; date of current version 16 Mar. 2016. Recommended for acceptance by R. Kwok. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2015.2425407 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016 951 1045-9219 ß 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
  • 2. documents can be divided into several categories. In other words, the points whose distance are short in the high dimensional space can be classified into a specific category. The search time can be largely reduced by selecting the desired category and abandoning the irrelevant categories. Comparing with all the documents in the dataset, the num- ber of documents which user aims at is very small. Due to the small number of the desired documents, a specific cate- gory can be further divided into several sub-categories. Instead of using the traditional sequence search method, a backtracking algorithm is produced to search the target documents. Cloud server will first search the categories and get the minimum desired sub-category. Then the cloud server will select the desired k documents from the mini- mum desired sub-category. The value of k is previously decided by the user and sent to the cloud server. If current sub-category can not satisfy the k documents, cloud server will trace back to its parent and select the desired documents from its brother categories. This process will be executed recursively until the desired k documents are satisfied or the root is reached. To verify the integrity of the search result, a verifiable structure based on hash function is constructed. Every document will be hashed and the hash result will be used to represent the document. The hashed results of docu- ments will be hashed again with the category information that these documents belong to and the result will be used to represent the current category. Similarly, every category will be represented by the hash result of the combination of current category information and sub-categories informa- tion. A virtual root is constructed to represent all the data and categories. The virtual root is denoted by the hash result of the concatenation of all the categories located in the first level. The virtual root will be signed so that it is verifiable. To verify the search result, user only needs to verify the vir- tual root, instead of verifying every document. 2 EXISTING SOLUTIONS In recent years, searchable encryption which provides text search function based on encrypted data has been widely studied, especially in security definition, formalizations and efficiency improvement, e.g. [2], [3], [4], [5], [6], [7]. As shown in Fig. 1, the proposed method is compared with existing solutions and has the advantage in main- taining the relationship between documents. 2.1 Single Keyword Searchable Encryption Song et al. [2] first introduced the notion of searchable encryption. They propose to encrypt each word in the docu- ment independently. This method has a high searching cost due to the scanning of the whole data collection word by word. Goh [8] formally defined a secure index structure and formulate a security model for index known as semantic security against adaptive chosen keyword attack (ind-cka). They also developed an efficient ind-cka secure index con- struction called z-idx by using pseudo-random functions and bloom filters. Cash et al. [41] recently design and imple- ment an efficient data structure. Due to the lack of rank mechanism, users have to take a long time to select what they want when massive documents contain the query key- word. Thus, the order-preserving techniques are utilized to realize the rank mechanism, e.g. [9], [10], [11]. Wang et al. [12] use encrypted invert index to achieve secure ranked keyword search over the encrypted documents. In the search phase, the cloud server computes the relevance score between documents and the query. In this way, relevant documents are ranked according to their relevance score and users can get the top-k results. In the public key setting, Boneh et al. [3] designed the first searchable encryption con- struction, where anyone can use public key to write to the data stored on server but only authorized users owning pri- vate key can search. However, all the above mentioned tech- niques only support single keyword search. 2.2 Multiple Keyword Searchable Encryption To enrich search predicates, a variety of conjunctive key- word search methods (e.g. [7], [13], [14], [15], [16]) have been proposed. These methods show large overhead, such as communication cost by sharing secret, e.g. [14], or computa- tional cost by bilinear map, e.g.[7]. Pang et al. [17] propose a secure search scheme based on vector space model. Due to the lack of the security analysis for frequency information and practical search performance, it is unclear whether their scheme is secure and efficient or not. Cao et al. [18] present a novel architecture to solve the problem of multi-keyword ranked search over encrypted cloud data. But the search time of this method grows exponentially accompanying with the exponentially increasing size of the document collections. Sun et al. [19] give a new architecture which achieves better search efficiency. However, at the stage of index building process, the relevance between documents is ignored. As a result, the relevance of plaintexts is concealed by the encryption, users expectation cannot be fulfilled well. For example: given a query containing Mobile and Phone, only the documents containing both of the keywords will be retrieved by traditional methods. But if taking the semantic relationship between the documents into consideration, the documents containing Cell and Phone should also be retrieved. Obviously, the second result is better at meeting the users expectation. 2.3 Verifiable Search Based on Authenticated Index The idea of data verification has been well studied in the area of databases. In a plaintext database scenario, a variety of methods have been produced, e.g. [20], [21], [22]. Most of these works are based on the original work by Merkle [23], [24] and refinements by Naor and Nissm [25] for certificate revocation. Merkle hash tree and cryptographic signature techniques are used to construct authenticated tree struc- ture upon which end users can verify the correctness and completeness of the query results. Fig. 1. Architecture of ciphertext search. 952 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
  • 3. Pang and Mouratidis [26] apply the Merkle hash tree based on authenticated structure to text search engines. However, they only focus on the verification-specific issues ignoring the search privacy preserving capabilities that will be addressed in this paper. The hash chain is used to construct a single keyword search result verification scheme by Wang et al. [9]. Sun et al. [19] use Merkle hash tree and cryptographic signature to create a verifiable MDB-tree. However, their work cannot be directly used in our architecture which is oriented for pri- vacy-preserving multiple keyword search. Thus, a proper mechanism that can be used to verify the search results within big data scenario is essential to both the CSPs and end users. 3 OUR CONTRIBUTION In this paper, we propose a multi-keyword ranked search over encrypted data based on hierarchical clustering index (MRSE-HCI) to maintain the close relationship between dif- ferent plain documents over the encrypted domain in order to enhance the search efficiency. In the proposed architec- ture, the search time has a linear growth accompanying with an exponential growing size of data collection. We derive this idea from the observation that users retrieval needs usu- ally concentrate on a specific field. So we can speed up the searching process by computing relevance score between the query and documents which belong to the same specific field with the query. As a result, only documents which are classified to the field specified by users query will be evalu- ated to get their relevance score. Due to the irrelevant fields ignored, the search speed is enhanced. We investigate the problem of maintaining the close rela- tionship between different plain documents over an encrypted domain and propose a clustering method to solve this problem. According to the proposed clustering method, every document will be dynamically classified into a specific cluster which has a constraint on the minimum relevance score between different documents in the dataset. The relevance score is a metric used to evaluate the relationship between different documents. Due to the new documents added to a cluster, the constraint on the cluster may be bro- ken. If one of the new documents breaks the constraint, a new cluster center will be added and the current document will be chosen as a temporal cluster center. Then all the docu- ments will be reassigned and all the cluster centers will be reelected. Therefore, the number of clusters depends on the number of documents in the dataset and the close relation- ship between different plain documents. In other words, the cluster centers are created dynamically and the number of clusters is decided by the property of the dataset. We propose a hierarchical method in order to get a better clustering result within a large amount of data collection. The size of each cluster is controlled as a trade-off between clustering accuracy and query efficiency. According to the proposed method, the number of clusters and the minimum relevance score increase with the increase of the levels whereas the maximum size of a cluster reduces. Depending on the needs of the grain level, the maximum size of a cluster is set at each level. Every cluster needs to satisfy the con- straints. If there is a cluster whose size exceeds the limitation, this cluster will be divided into several sub-clusters. We design a search strategy to improve the rank pri- vacy. In the search phase, the cloud server will first com- pute the relevance score between query and cluster centers of the first level and then chooses the nearest clus- ter. This process will be iterated to get the nearest child cluster until the smallest cluster has been found. The cloud server computes the relevance score between query and documents included in the smallest cluster. If the smallest cluster can not satisfy the number of desired documents which is previously decided by user, cloud server will trace back to the parent cluster of the smallest cluster and the brother clusters of the smallest cluster will be searched. This process will be iterated until the num- ber of desired documents is satisfied or the root is reached. Due to the special search procedures, the rank- ings of documents among their search results are differ- ent with the rankings derived from traditional sequence search. Therefore, the rank privacy is enhanced. Some part of the above work has been presented in [27]. For further improvement, we also construct a verifi- able tree structure upon the hierarchical clustering method to verify the integrity of the search result in this paper. This authenticated tree structure mainly takes the advantage of the Merkle hash tree and cryptographic sig- nature. Every document will be hashed and the hash result will be used as the representative of the document. The smallest cluster will be represented by the hash result of the combination of the concatenation of the documents included in the smallest cluster and own category infor- mation. The parent cluster is represented by the hash result of the combination of the concatenation of its chil- dren and own category information. A virtual root is added and represented by the hash result of the concate- nation of the categories located in the first level. In addi- tion, the virtual root will be signed so that user can achieve the goal of verifying the search result by verifying the virtual root. In short, our contributions can be summarized as follows: 1) We investigate the problem of maintaining the close relationship between different plain documents over an encrypted domain and propose a clustering method to solve this problem. 2) We proposed the MRSE-HCI architecture to speed up server-side searching phase. Accompanying with the exponential growth of document collection, the search time is reduced to a linear time instead of exponential time. 3) We design a search strategy to improve the rank pri- vacy. This search strategy adopts the backtracking algorithm upon the above clustering method. With the growing of the data volume, the advantage of the proposed method in rank privacy tends to be more apparent. 4) By applying the Merkle hash tree and cryptographic signature to authenticated tree structure, we provide a verification mechanism to assure the correctness and completeness of search results. The organization of the following parts of the paper is as follows: Section 4 describes the system model, threat model, design goals and notations. The architecture and CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 953
  • 4. detailed algorithm are displayed in Section 5. We discuss the efficiency and security of MRSE-HCI scheme in Section 6. An evaluation method is provided in Section 7. Section 8 demonstrates the result of our experiments. Section 9 concludes the paper. 4 DEFINITION AND BACKGROUND 4.1 System Model The system model contains three entities, as illustrated in Fig. 1, the data owner, the data user, and the cloud server. The box with dashed lines in the figure indicates the added component to the existing architecture. The data owner is responsible for collecting documents, building document index and outsourcing them in an encrypted format to the cloud server. Apart from that, the data user needs to get the authorization from the data owner before accessing to the data. The cloud server provides a huge storage space, and the computation resources needed by ciphertext search. Upon receiving a legal request from the data user, the cloud server searches the encrypted index, and sends back top-k documents that are most likely to match users query [11]. The number k is properly chosen by the data user. Our system aims at protecting data from leaking information to the cloud server while improving the effi- ciency of ciphertext search. In this model, both the data owner and the data user are trusted, while the cloud server is semi-trusted, which is con- sistent with the architecture in [9], [18], [28]. In other words, the cloud server will strictly follow the predicated order and try to get more information about the data and the index. 4.2 Threat Model The adversarys ability can be concluded in two threat models. Known ciphertext model. In this model, Cloud server can get encrypted document collection, encrypted data index, and encrypted query keywords. Known background model. In this model, cloud server knows more information than that in known ciphertext model. Statistical background information of dataset, such as the document frequency and term frequency information of a specific keyword, can be used by the cloud server to launch a statistical attack to infer or identify specific key- word in the query [9], [10], which further reveals the plain- text content of documents. The adversarys ability can be represented in the above two threat models. 4.3 Design Goals Search efficiency. The time complexity of search time of the MRSE-HCI scheme needs to be logarithmic against the size of data collection in order to deal with the explosive growth of document size in big data scenario. Retrieval accuracy. Retrieval precision is related to two factors: the relevance between the query and the documents in result set, and the relevance of docu- ments in the result set. Integrity of the search result. The integrity of the search results includes three aspects: 1) Correctness. All the documents returned from servers are originally uploaded by the data owner and remain unmodified. 2) Completeness. No qualified documents are omit- ted from the search results. 3) Freshness. The returned documents are the latest version of documents in the dataset. Privacy requirements. We set a series of privacy requirements which current researchers mostly focus on. 1) Data privacy. Data privacy presents the confi- dentiality and privacy of documents. The adver- sary cannot get the plaintext of documents stored on the cloud server if data privacy is guaranteed. Symmetric cryptography is a con- ventional way to achieve data privacy. 2) Index privacy. Index privacy means the ability to frustrate the adversary attempt to steal the infor- mation stored in the index. Such information includes keywords and the TF (Term Frequency) of keywords in documents, the topic of docu- ments, and so on. 3) Keyword privacy. It is important to protect users query keywords. Secure query generation algo- rithm should output trapdoors which leak no information about the query keywords. 4) Trapdoor unlinkability. Trapdoor unlinkability means that each trapdoor generated from the query is different, even for the same query. It can be realized by integrating a random function in the trapdoor generation process. If the adversary can deduce the certain set of trapdoors which all corresponds to the same keyword, he can calculate the frequency of this keyword in search request in a certain period. Combined with the document frequency of keyword in known background model, he/she can use statistical attack to identify the plain keyword behind these trapdoors. 5) Rank privacy. Rank order of search results should be well protected. If the rank order remains unchanged, the adversary can compare the rank order of different search results, further identify the search keyword. 4.4 Notations In this paper, notations presented in Table 1 are used. 5 ARCHITECTURE AND ALGORITHM 5.1 System Model In this section, we will introduce the MRSE-HCI scheme. The vector space model adopted by the MRSE-HCI scheme is same as the MRSE [18], while the process of building index is totally different. The hierarchical index structure is introduced into the MRSE-HCI instead of sequence index. In MRSE-HCI, every document is indexed by a vector. Every dimension of the vector stands for a keyword and the value represents whether the keyword appears or not in the document. Similarly, the query is also represented by a vec- tor. In the search phase, cloud server calculates the rele- vance score between the query and documents by 954 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
  • 5. computing the inner product of the query vector and docu- ment vectors and return the target documents to user according to the top k relevance score. Due to the fact that all the documents outsourced to the cloud server is encrypted, the semantic relationship between plain documents over the encrypted documents is lost. In order to maintain the semantic relationship between plain documents over the encrypted documents, a cluster- ing method is used to cluster the documents by clustering their related index vectors. Every document vector is viewed as a point in the n-dimensional space. With the length of vectors being normalized, we know that the dis- tance of points in the n-dimensional space reflect the rele- vance of corresponding documents. In other word, points of high relevant documents are very close to each other in the n-dimensional space. As a result, we can cluster the docu- ments based on the distance measure. With the volume of data in the data center has experienced a dramatic growth, conventional sequence search approach will be very inefficient. To promote the search efficiency, a hierarchical clustering method is proposed. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold at different levels, and then partitions the resulting clusters into sub-clusters until the con- straint on the maximum size of cluster is reached. Upon receiving a legal request, cloud server will search the related indexes layer by layer instead of scanning all indexes. 5.2 MRSE-HCI Architecture MRSE-HCI architecture is depicted by Fig. 2, where the data owner builds the encrypted index depending on the dictionary, random numbers and secret key, the data user submits a query to the cloud server for getting desired documents, and the cloud server returns the target docu- ments to the data user. This architecture mainly consists of following algorithms. Keygenð1lðnÞ Þ ! ðsk; kÞ. It is used to generate the secret key to encrypt index and documents. IndexðD; skÞ ! I. Encrypted index is generated in this phase by using the above mentioned secret key. At the same time, clustering process is also included current phase. EncðD; kÞ ! E. The document collection is encrypted by a symmetric encryption algorithm which achieves semantic security. Trapdoorðw; skÞ ! Tw. It generates encrypted query vector Tw with users input keywords and secret key. SearchðTw; I; ktopÞ ! ðIw; EwÞ. In this phase, cloud server compares trapdoor with index to get the top-k retrieval results. DecðEw; kÞ ! Fw. The returned encrypted documents are decrypted by the key generated in the first step. The concrete functions of different components is described as below. 1) Keygenð1lðnÞ Þ. The data owner randomly generates a ðn þ u þ 1Þ bit vector S where every element is a integer 1 or 0 and two invertible ðn þ u þ 1ÞÂ ðn þ u þ 1Þmatrices whose elements are random inte- gers as secret key sk. The secret key k is generated by the data owner choosing an n-bit pseudo sequence. 2) IndexðD; skÞ. As show in the Fig. 3, the data owner uses tokenizer and parser to analyze every docu- ment and gets all keywords. Then data owner uses the dictionary Dw to transform documents to a collection of document vectors DV . Then the data owner calculates the DC and CCV by using a qual- ity hierarchical clustering (QHC) method which will be illustrated in section C. After that, the data owner applies the dimension-expanding and TABEL 1 Notations di The ith document vector, denoted as di ¼ fdi;1; . . . ; di;ng, where di;j represents whether the jth keyword in the dictionary appears in document di. m The number of documents in the data collection. n The size of dictionary DW . CCV The collection of cluster centers vectors, denoted as CCV ¼ fc1; . . . ; cng, where ci is the average vector of all document vectors in the cluster. CCVi The collection of the ith level cluster center vectors, denoted as CCVi ¼ fvi;1; . . . ; vi;ng where Vi;j represents the jth vector in the ith level. DC The information of documents classification such as document id list of a certain cluster. DV The collection of document vectors, denoted as DV ¼ fd1; d2; . . . ; dmg. DW The dictionary, denoted as Dw ¼ fw1; w2; . . . ; wng. Fw The ranked id list of all documents according to their relevance to keyword w. Ic The clustering index which contains the encrypted vectors of cluster centers. Id The traditional index which contains encrypted document vectors. Li The minimum relevance score between different documents in the ith level of a cluster. QV The query vector. TH A fixed maximum number of documents in a cluster. Tw The encrypted query vector for users query. Fig. 2. MRSE-HCI architecture. Fig. 3. Algorithm index. CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 955
  • 6. vector-splitting procedure to every document vec- tor. It is worth noting that CCV is treated equally as DV . For dimension-expanding, every vector in DV is extended to ðn þ u þ 1Þ bit-long, where the value in nþ jð0 j uÞ dimension is an integer number generated randomly and the last dimen- sion is set to 1. For vector-splitting, every extended document vector is split into two ðn þ u þ 1Þ bit- long vectors, V 0 and V 00 with the help of the ðn þ u þ 1Þbit vector S as a splitting indicator. If the ith element of S (Si ) is 0, then we set V 00 i ¼ V 0 i ¼ Vi ; If ith element of S (Si ) is 1, then V 00 i is set to a random number and V 0 i ¼ Vi À V 00 i . Finally, the traditional index Id is encrypted as Id ¼ fMT 1 V 0 ; MT 2 V 00 gby using matrix multiplication with the sk, and Ic is generated in a similar way. After this, Id ,Ic , and DC are outsourced to the cloud server. 3) EncðD; kÞ. The data owner adopts a secure symmet- ric encryption algorithm (e.g. AES) to encrypt the plain document set D and outsources it to the cloud server. 4) Trapdoorðw; skÞ. The data user sends the query to the data owner who will later analyze the query and builds the query vector QV by analyzing the keywords of query with the help of dictionary DW , QV then is extended to a ðn þ u þ 1Þ bit query vector. Subsequently,v random positions chosen from a range ðn; n þ uŠ in QV are set to 1, others are set to 0.The value at last dimension of QV is set to a random number t½0; 1Š. Then the first ðn þ uÞdimensions of QW , denoted as qw, is scaled by a random number rðr 6¼ 0Þ ,Qw ¼ ðr Á qw; tÞ . After that, Qw is split into two random vectors as fQ0 W ; Q00 W g with vector-splitting procedure which is similar to that in the IndexðD; skÞ phase. The dif- ference is that if the ith bit of S is 1, then we have q0 i ¼ q00 i ¼ qi; If the ith bit of S is 0, q0 i is set as a random number and q00 i ¼ qi À q0 i. Finally, the encrypted query vector Tw is generated as Tw ¼ fMÀ1 1 Q0 w; MÀ1 2 Q00 wg and sent back to the data user. 5) SearchðTw; I; ktopÞ. Upon receiving the Tw from data user, the cloud server computes the relevance score between Tw and index Ic and then chooses the matched cluster which has the highest rele- vance score. For every document contained in the matched cluster, the cloud server extract its corre- sponding encrypted document vector in Id , and calculates its relevance score S with Tw , as described in the Equation (1). Finally, these scores of documents in the matched cluster are sorted and the top ktop documents are returned by the cloud server. The detail will be discussed in the Section 5.5. S ¼ Tw Á Ic ¼ fMÀ1 1 Q0 w; MÀ1 2 Q00 wg Á fMT 1 V 0 ; MT 2 V 00 g ¼ Q0 w Á V 0 þ Q0 w Á V 00 ¼ Qw Á V: (1) 6) DecðEw; kÞ. The data user utilizes the secret key k to decrypt the returned ciphertext Ew. 5.3 Relevance Measure In this paper, the concept of coordinate matching [29] is adopted as a relevance measure. It is used to quantify the relevance of document-query and document-document. It is also used to quantify the relevance of the query and cluster centers. Equation (2) defines the relevance score between document di and query qw . Equation (3) defines the relevance score between query qw and cluster center lci;j . Equation (4) defines the relevance score between document di and dj. Sqdi ¼ Xnþuþ1 t¼1 ðqw;t  di;tÞ (2) Sqci ¼ Xnþuþ1 t¼1 ðqw;t  lci;j;tÞ (3) Sddi ¼ Xnþuþ1 t¼1 ðdi;t  dj;tÞ: (4) 5.4 Quality Hierarchical Clustering Algorithm So far, a lot of hierarchical clustering methods has been proposed. However all of these methods are not compara- ble to the partition clustering method in terms of time complexity performance. K-means [30] and K-medois [31] are popular partition clustering algorithms. But the k is fixed in the above two methods, which can not be applied to the situation of dynamic number of cluster centers. We propose a quality hierarchical clustering algorithm based on the novel dynamic K-means. As the proposed dynamic K-means algorithm shown in the Fig. 4, the minimum relevance threshold of the clus- ters is defined to keep the cluster compact and dense. If the relevance score between a document and its center is smaller than the threshold, a new cluster center is added and all the documents are reassigned. The above procedure will be iterated until k is stable. Comparing with the traditional clustering method, k is dynamically changed during the clustering process. This is why it is called dynamic K-means algorithm . The QHC algorithm is illustrated in the Fig. 5. It goes like that. Every cluster will be checked on whether its Fig. 4. Algorithm dynamic k-means. Fig. 5. Algorithm quality hierarchical clustering (QHC). 956 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
  • 7. size exceeds the maximum number TH or not. If the answer is “yes”, this “big” cluster will be split into child clusters which are formed by using the dynamic K-means on the documents of this cluster. This procedure will be iterated until all clusters meet the requirement of maxi- mum cluster size. Clustering procedure is illustrated in Fig. 6. All the documents are denoted as points in a coor- dinate system. These points are initially partitioned into two clusters by using dynamic K-means algorithm when the k ¼ 2. These two bigger clusters are depicted by the elliptical shape. Then these two clusters are checked to see whether their points satisfy the distance constraint. The second cluster does not meet this requirement, thus a new cluster center is added with k ¼ 3 and the dynamic K-means algorithm runs again to partition the second clus- ter into two parts. Then the data owner checks whether these clusters size exceed the maximum number TH. Cluster 1 is split into two sub-clusters again due to its big size. Finally all points are clustered into four clusters as depicted by the rectangle. 5.5 Search Algorithm The cloud server needs to find the cluster that most matches the query. With the help of cluster index Ic and document classification DC , the cloud server uses an iterative proce- dure to find the best matched cluster. Following instance demonstrates how to get matched one: 1) The cloud server computes the relevance score between Query Tw and encrypted vectors of the first level cluster centers in cluster index Ic, then chooses the ith cluster center Ic;1;i which has the highest score. 2) The cloud server gets the child cluster centers of the cluster center, then computes the relevance score between Tw and every encrypted vectors of child cluster centers, and finally gets the cluster center Ic;2;i with the highest score. This procedure will be iter- ated until that the ultimate cluster center Ic;l;j in last level l is achieved. In the situation depicted by Fig. 7, there are nine docu- ments which are grouped into three clusters. After calculat- ing the relevance score with trapdoor Tw , cluster 1, which is shown within the box of dummy line in Fig. 7, is found to be the best match. Documents d1,d3 ,d9 belong to cluster 1, then their encrypted document vectors in the Id are extracted out to compute the relevance score with Tw. 5.6 Search Result Verification The retrieved data have high possibility to be wrong since the network is unstable and the data may be damaged due to the hardware/software failure or malicious administrator or intruder. Verifying the authenticity of search results is emerging as a critical issue in the cloud environment. We, therefore, designed a signed hash tree to verify the correct- ness and freshness of the search results. Building. The data owner builds the hash tree based on the hierarchical index structure. The algorithm shown in the Fig. 8 is described as follows. The hash value of the leaf node of the tree is hðid k version k FðidÞÞ where id means document id, version means document version and FðidÞ means the document contents. The value of non-leaf node is a pair of values ðid; hðid k hchildÞÞ where id denotes the value of the cluster center or document vector in the encrypted index, and hchild is the hash value of its child node. The hash value of tree root node is based on the hash values of all clusters in the first level. It is worth not- ing that the root node denotes the data set which contains all clusters. Then the data owner generates the signature of the hash values of the root node and outsources the hash tree including the root signature to the cloud server. Cryptographic signature s (e.g., RSA signature, DSA signature) can be used here to authenticate the hash value of root node. Processing. By the algorithm shown in the Fig. 9, the cloud server returns the root signature and the mini- mum hash sub-tree (MHST) to client. The minimum hash sub-tree includes the hash values of leaf nodes in the matched cluster and non-leaf node corre- sponding to all cluster centers used to find the matched cluster in the searching phase. For example, in the Fig. 10, the search result is document D, E and F. Then the leaf nodes are D, E, F and G, and non- leaf nodes includes C1, C2, C3, C4, dD, dE, dF , and dG. In addition, the root is included in the non-leaf node. Verifying. The data owner uses the minimum hash sub-tree to re-compute the hash values of nodes, in particular the root node which can be further verified by the root signature. If all nodes are matched, then Fig. 7. Retrieval process. Fig. 8. Algorithm building-minimum hash sub-tree. Fig. 9. Algorithm processing-minimum hash sub-tree. Fig. 6. Clustering process. CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 957
  • 8. the correctness and freshness is guaranteed. Then the data owner re-searches the index constructed by retrieved values in MHST. If the search result is same as the retrieved result, the completeness, correctness and freshness all are guaranteed. As shown in the Fig. 10, in the building phase, all docu- ments are clustered into two big clusters and four small clus- ters, and each big cluster contains two small clusters. The hash value of leaf node A is hðidA k version k FðidAÞÞ , the value of the non-leaf node C3 is ðidC3 ; hðidC3 k hA k hB k hCÞÞ, and the value of non-leaf node C1 is ðidC1 ; hðidC1 k hC3 k hC4 ÞÞ. The other values of leaf nodes and non-leaf nodes are generated similarly. In order to combine all first-level clusters into a tree, a virtual root node is created by the data owner with a hash value hðhC1;2 k hC2;2 Þ where C1;2 and C2;2 denotes the second part of cluster center 1 and 2 respectively. Then the data owner signs the root node, e.g., sðhðhC1;2 k hC2;2 ÞÞ ¼ ðhC1;2 k hC2;2 ; eðhðhC1;2 k hC2;2 ÞÞk ; gÞ, and outsources it to the cloud server. In the processing phase, suppose that the cluster C4 is the matched cluster and the returned top-three documents are D, E, and F. Then the minimum hash sub-tree includes the hash values of node D, E, F, dD, dE, dF , dG, C3, C2, C1, C4 and the signed root sðhðhC1;2 k hC2;2 ÞÞ. In the verifying phase, upon receiving the signed root, the data user first check eðhðhC1;2 k hC2;2 Þ; gÞk ¼ ? eðsigkhðhC1;2 k hC2;2 Þ; gÞ . If it is not true, the retrieved hash tree is not authentic, otherwise the returned nodes, D, E, F, dD, dE, dF , dG, C3, C2, C1, C4, works together to verify each other and reconstruct the hash tree. If all the nodes are authenticate, the returned hash tree are authenticate. Then the data user re-computes the hash value of the leaf nodes D, E and F by using returned documents. These new generated hash val- ues are compared with the corresponding returned hash values. If there is no difference, the retrieved documents is correct. Finally, the data user uses the trapdoor to re-search the index constructed by the first part of retrieved nodes. If the search result is same as the retrieved result, the search result is complete. 5.7 Dynamic Data Collection As the documents stored at server may be deleted or modi- fied and new documents may be added to the original data collection, a mechanism which supports dynamic data collec- tion is necessary. A naive way to address these problems is downloading all documents and index locally and updating the data collection and index. However, this method needs huge cost in bandwidth and local storage space. To avoid updating index frequently, we provide a practi- cal strategy to deal with insertion, deletion and modification operations. Without loss of generality, we use following examples to illustrate the workings of the strategy. The data owner preserves many empty entries in the dictionary for new documents. If a new document contains new key- words, the data owner first adds these new keywords to the dictionary and then constructs a document vector based on the new dictionary. The data owner sends the trapdoor gen- erated by the document vector, encrypted document and encrypted document vector to the cloud sever. The cloud sever finds the closest cluster, and puts the encrypted docu- ment and encrypted document vector into it. As every cluster has a constraint on the maximum size, it is possible that the number of documents in a cluster exceeds the limitation due to the insertion operation. In this case, all the encrypted document vectors belonging to the broken cluster are returned to the data owner. After decryp- tion of the retrieved document vectors, the data owner re- builds the sub-index based on the deciphered document vectors. The sub-index is re-encrypted and re-outsourced to the cloud server. Upon receiving a deletion order, the cloud server searches the target document. Then the cloud server deletes the document and the corresponding document vector. Modifying a document can be described as deleting the old version of the document and inserting the new version. The operation of modifying documents, therefore, can be realized by combining insertion operation and deletion operation. To deal with this impact on the hash tree, a lazy update strategy is designed. For the insertion operation, the corre- sponding hash value will be calculated and marked as a raw node, while the original nodes in the hash tree will be kept unchanged because the original hash tree still supports document verification except the new document. Only when the new added document is accessed, the hash tree will be updated. Similar concept is used in the deletion operation. The only difference is that the deletion operation will not bring the hash tree update. 6 EFFICIENCY AND SECURITY 6.1 Search Efficiency The search process can be divided into Trapdoorðw; skÞ phase and SearchðTw; I; ktopÞ phase. The number of opera- tion needed in Trapdoorðw; skÞ phase is illustrated as in Equation (5), where, n is the number of keywords in the dic- tionary, and w is the number of query keywords, OðMRSE À HCIÞ ¼ 5n þ u À v À w þ 5: (5) Due to the time complexity of Trapdoorðw; skÞ phase independent to DC, when DC increases exponentially,it can be described as O(1). The difference of the search process between the MRSE- HCI and the MRSE is the retrieval algorithm used in this phase. In the SearchðTw; I; ktopÞ phase of the MRSE, the cloud server needs to compute the relevance score between the encrypted query vector Tw and all encrypted document vectors in Id , and get the top-k ranked document list Fw . The number of operations need in SearchðTW ; I; ktopÞ phase is illustrated as in Equation (6), where m represents the Fig. 10. Authentication for hierarchical clustering index. 958 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
  • 9. number of documents in DC ,and n represents the number of keywords in the dictionary, OðMRSEÞ ¼ 2m à ð2n þ 2u þ 1Þ þ m À 1: (6) However, in the SearchðTW ; I; ktopÞ phase of MRSE-HCI, the cloud server uses the information DC to quickly locate the matched cluster and only compares Tw to a limited number of encrypted document vectors in Id . The number of opera- tions needed in SearchðTW ; I; ktopÞ phase is illustrated in Equation (7), where ki represents the number of cluster cen- ters needed to be compared with in the ith level, and c repre- sents the number of document vectors in the matched cluster, OðMRSE À HCIÞ ¼ Xl i¼1 ki ! à 2 à ð2n þ 2u þ 1Þ þ cð2 à ð2n þ 2u þ 1ÞÞ þ c À 1: (7) When DC increases exponentially, m can be set to 2l . The time complexity of the traditional MRSE is Oð2l Þ , while the time complexity of the proposed MRSE-HCI is only OðlÞ. The total search time can be calculated as given in Equation (8) below, where OðtrapdoorÞ is Oð1Þ ,and OðqueryÞ relies on the DC, OðsearchTimeÞ ¼ OðtrapdoorÞ þ OðqueryÞ: (8) In short, when the number of documents in DC has an expo- nential growth, the search time of MRSE-HCI increases line- arly while the traditional methods increase exponentially. 6.2 Security Analysis To express the security analysis briefly, we adopt some con- cepts from [37], [38], [39] and define what kinds of informa- tion will be leaked to the curious-but-honest server. The basic information of documents and queries are inevitably leaked to the honest-but-curious server since all the data are stored at the server and the queries submitted to the server. Moreover, the access pattern and search pat- tern cannot be preserved in MRSE-HCI as well as previous searchable encryption [18], [38], [39], [40]. Definition 1 (Size pattern). Let D be a document collection. The size pattern induced by a q-query is a tuple aðD; QÞ ¼ ðm; jQ1j; . . . ; jQqjÞ where m is the number of docu- ments and jQij is the size of query Qi. Definition 2 (Access pattern). Let D be a document collection and I be an index over D. The access pattern induced by a q-query is a tuple bðD; QÞ ¼ ðIðQ1Þ; IðQqÞÞ, where IðQiÞ is a set of identifiers returned by query Qi, for 1 i q. Definition 3 (Search pattern). Let D be a document collection. The search pattern induced by a q-query is a m  q binary matrix cðD; QÞ such that for 1 i m and 1 j q the ele- ment in the ith row and jth column is 1, if an document iden- tifier idi is returned by a query Qj. Definition 4 (known ciphertext model secure). Let P ¼ ðKeygen; Index; Enc; Trapdoor; Search; DecÞ be an index-based MRSE-HCI scheme over dictionary Dw, n 2 N, be the security parameter, the known ciphertext model secure experiment PrivKkcm A;P ðnÞ is described as follows. 1) The adversary submits two document collections D0 and D1 with the same length to a challenger. 2) The challenger generates a secret key fsk; kg by run- ning Keygenð1lðnÞ Þ. 3) The challenger randomly choose a bit b 2 f0; 1g, and returns IndexðDb; skbÞ ! Ib and EncðDb; kbÞ ! Eb to the adversary. 4) The adversary outputs a bit b0 5) The output of the experiment is defined to be 1 if b0 ¼ b, and 0 otherwise. We say MRSE-HCI scheme is secure under known cipher- text model if for all probabilistic polynomial-time adversar- ies A there exists a negligible function neglðnÞ such that PrðPrivkkcm A;P ¼ 1Þ 1=2 þ neglðnÞ: (9) Proof. The adversary A distinguishes the document collec- tions depending on analyzing the secret key, index and encrypted document collection. Then we have Equation (10), where AdvðADðsk; kÞÞ is the advantage for adversary A to distinguish the secret key from two random matrixes and two random strings, AdvðADðIÞÞ is the advantage to distinguish the index from a random string and AdvðADðEÞÞ is the advantage to distinguish the encrypted documents from random strings. PrðPrivKkcm A;P ðnÞ ¼ 1Þ ¼ 1=2 þ AdvðADðsk; kÞÞ þ AdvðADðIÞÞ þ AdvðADðEÞÞ (10) tu The elements of two matrixes in the secret key are randomly chosen from f0; 1glðnÞ , and the split indicator S and key k are also chosen uniformly at random from f0; 1glðnÞ . Given f0; 1glðnÞ , A distinguishes the secret key from two random matrixes and two random strings with a negligible probability. Then there exits a negligible function negl1ðnÞ such that AdvðADðsk; kÞÞ ¼ jPrðKeygenð1lðnÞ Þ ! ðsk; kÞÞ À PrðRandom ! ðskr; krÞÞj negl1ðnÞ; (11) where skr denotes two random matrixes and a random string, and kr is a random string. In our scheme, the encryp- tion of hierarchical index is essential to encrypt all the docu- ment vectors and cluster center vectors. All the cluster center vectors are treated as document vectors in the encryption phase. Eventually, all the document vectors and cluster center vectors are encrypted by the secure KNN. As the secure KNN is known plaintext attack (KPA) secure [32], the hierarchical index is secure under the known ciphertext model. Then there exists a negligible function negl2ðnÞ satisfying that AdvðADðIÞÞ ¼ jPrðIndexðD; skÞ ! ðIÞÞ À PrðRandom ! ðIrÞÞj negl2ðnÞ; (12) where Ir is a random string. Since the encryption algorithm used to encrypt Db is semantic secure, the encrypted documents are secure under CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 959
  • 10. known ciphertext model. Then there exists a negligible func- tion negl3ðnÞ such that AdvðADðEÞÞ ¼ jPrðEncðD; kÞ ! ðEÞÞ À PrðRandom ! ðErÞÞj negl3ðnÞ: (13) Where Er is a random string set. According Equations (10), (11), (12) and (13), we can get Equation (14), PrðPrivkkcm A;P ¼ 1Þ 1=2 þ negl1ðnÞ þ negl2ðnÞ þ negl3ðnÞ (14) neglðnÞ ¼ negl1ðnÞ þ negl2ðnÞ þ negl3ðnÞ (15) PrðPrivkkcm A;PÞ 1=2 þ neglðnÞ: (16) By combining Equations (14) and (15), we can conclude Equation (16). Then, we say MRSE-HCI is secure under know ciphertext model. 7 EVALUATION METHOD 7.1 Search Precision The search precision can quantify the users satisfaction. The Retrieval precision is related to two factors: the relevance between documents and the query, and the relevance of documents between each other. Equation (17) defines the relevance between retrieved documents and the query, Pq ¼ Xk0 i¼1 Sðqw; diÞ= Xk i¼1 Sðqw; diÞ ! : (17) Here, k0 denotes the number of files retrieved by the evalu- ated method, k denotes the number of files retrieved by plain text search, qw represents query vector, di represents document vector, and S is a function to compute the rele- vance score between qw and di. Equation (18) defines the rel- evance of different retrieved documents, Pd ¼ Xk0 j¼1 Xk0 i¼1 Sðdj; diÞ= Xk j¼1 Xk i¼1 Sðdj; diÞ ! : (18) Here, k0 denotes the number of files retrieved by the evalu- ated method, k denotes the number of files retrieved by plaintext search, and both di and dj denote document vector. Equation (19) combines the relevance between query and retrieved documents and relevance of documents to quan- tify the search precision such that Acc ¼ aPq þ Pd; (19) where a functions as a tradeoff parameter to balance the rel- evance between query and documents and relevance of documents. If a is smaller than 1, it puts more emphasis on the relevance of documents otherwise query keywords. The above evaluation strategies should be based on the same dataset and keywords. 7.2 Rank Privacy Rank privacy can quantify the information leakage of the search results. The definition of rank privacy is adopted from [18]. Equation (20) is used to evaluate the rank privacy, Pk ¼ Xk i¼1 Pi=k: (20) Here, k denotes the number of top-k retrieved docu- ments, pi ¼ ci0 À cij j , ci0 is the ranking of document di in the retrieved top-k documents,ci is the actual ranking of docu- ment di in the data set, and Pi is set to k if greater than k . The overall rank privacy measure at point k, denoted as Pk, is defined as the average value of pi for every document di in the retrieved top-k documents. 8 PERFORMANCE ANALYSIS In order to test the performance of MRSE-HCI on real data- set, we built an experimental platform to test the search effi- ciency, accuracy and rank privacy. We implemented the target experiment based on a distributed platform which includes three ThinkServer RD830 and a ThinkCenter M8400t. The data set is built from IEEE Xplore, including about 51;000 documents, and 22;000 keywords. According to the notations defined in Section 4, n denotes the dictionary size, k denotes the number of top-k docu- ments, m denotes the number of documents in the data set, and w denotes the number of keywords in the users query. Fig. 11 is used to describe search efficiency with different conditions. Fig. 11a describes search efficiency using the dif- ferent size of document set with unchanged dictionary size, number of retrieved documents and number of query key- words, n ¼ 22;157; k ¼ 20; w ¼ 5. In Fig. 11b, we adjust the value of k with unchanged dictionary size, document set size and number of query keywords, n ¼ 22;157; m ¼ 51;312; w ¼ 5. Fig. 11c tests the different number of query keywords with unchanged dictionary size, document set size and number of retrieved documents, n ¼ 22;157; m ¼ 51;312; k ¼ 20. From the Fig. 11a, we can observe that with the exponen- tial growth of document set size, the search time of MRSE increases exponentially, while the search time of MRSE À HCI increases linearly. As the Figs. 11b and 11c shows, the search time of MRSE À HCI keeps stable with the increase of query keywords and retrieved documents. Meanwhile, the search time is far below that of MRSE. Fig. 11. Search efficiency. 960 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
  • 11. Fig. 12 describes search accuracy by utilizing plaintext search as a standard. Fig. 12a illustrates the relevance of retrieved documents. With the number of documents increases from 3;200 to 51;200, the ratio of MRSE-to-plaintext search fluctuates at 1, while MRSE-HCI-to-plaintext search increases from 1:5 to 2. From the Fig. 12a, we can observe that the relevance of retrieved documents in the MRSE-HCI is almost twice as many as that in the MRSE, which means retrieved documents generated by MRSE-HCI are much closer to each other. Fig. 12b shows the relevance between query and retrieved documents. With the size of document set increases from 3;200 to 51;200, the MRSE-to-plaintext search ratio fluctuates at 0:75. MRSE-HCI-to-plaintext search ratio increases from 0:65 to 0:75 accompanying with the growth of document set size. From the Fig. 12b, we can see that the relevance between query and retrieved documents in MRSE-HCI is slightly lower than that in MRSE. Espe- cially, this gap narrows when the data size increases since a big document data set has a clear category distribution which improves the relevance between query and docu- ments. Fig. 12c shows the rank accuracy according to Equation (19). The tradeoff parameter a is set to 1, which means there is no bias towards relevance of documents or relevance between documents and query. From the result, we can conclude that MRSE-HCI is better than MRSE in rank accuracy. Fig. 13 describes the rank privacy according to Equation (20). In this test, no matter the number of retrieved documents, MRSE À HCI has better rank privacy than MRSE. This mainly caused by the relevance of documents introduced into search strategy. 9 CONCLUSION In this paper, we investigated ciphertext search in the sce- nario of cloud storage. We explore the problem of maintain- ing the semantic relationship between different plain documents over the related encrypted documents and give the design method to enhance the performance of the semantic search. We also propose the MRSE-HCI architec- ture to adapt to the requirements of data explosion, online information retrieval and semantic search. At the same time, a verifiable mechanism is also proposed to guarantee the correctness and completeness of search results. In addition, we analyze the search efficiency and security under two popular threat models. An experimental plat- form is built to evaluate the search efficiency, accuracy, and rank security. The experiment result proves that the pro- posed architecture not only properly solves the multi-key- word ranked search problem, but also brings an improvement in search efficiency, rank security, and the rel- evance between retrieved documents. ACKNOWLEDGMENTS This work was supported by Strategic Priority Re- search Program of Chinese Academy of Sciences (No. XDA06040601) and Xinjiang Uygur Autonomous Region science and technology plan (No. 201230121). An early version of this paper is presented at the Workshop on Security and Privacy in Big Data at IEEE INFOCOM 2014 [27]. Extensive enhancements have been made which includes incorporating a novel verification scheme to help data user verify the authenticity of the search results, and adding a security analysis as well more details of the pro- posed scheme. This work was supported by Strategic Priority Research Program of Chinese Academy of Sciences (No. XDA06010701) and National High Technology Research and Development Program of China(No. 2013AA01A24). REFERENCES [1] S. Grzonkowski, P. M. Corcoran, and T. Coughlin, “Security anal- ysis of authentication protocols for next-generation mobile and CE cloud services,” in Proc. IEEE Int. Conf. Consumer Electron., 2011, Berlin, Germany, 2011, pp. 83–87. [2] D. X. D. Song, D. Wagner, and A. Perrig, “Practical techniques for searches on encrypted data,” in Proc. IEEE Symp. Security Priv., BERKELEY, CA, 2000, pp. 44–55. [3] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public key encryption with keyword search,” in Proc. EUROCRYPT, Interlaken, SWITZERLAND, 2004, pp. 506–522. [4] Y. C. Chang and M. Mitzenmacher, “Privacy preserving key- word searches on remote encrypted data,” in Proc. 3rd Int. Conf. Applied Cryptography Netw. Security, New York, NY, 2005, pp. 442–455. [5] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchable symmetric encryption: improved definitions and efficient constructions,” in Proc. 13th ACM Conf. Comput. Commun. Security, Alexandria, Virginia, 2006, pp. 79–88. [6] M. Bellare, A. Boldyreva, and A. O’Neill, “Deterministic and effi- ciently searchable encryption,” in Proc. 27th Annu. Int. Cryptol. Conf. Adv. Cryptol., Santa Barbara, CA, 2007, pp. 535–552. [7] D. Boneh and B. Waters, “Conjunctive, subset, and range queries on encrypted data,” in Proc. 4th Conf. Theory Cryptography, Amsterdam, NETHERLANDS, 2007, pp. 535–554. [8] E.-J. Goh, Secure Indexes, IACR Cryptology ePrint Archive, vol. 2003, pp. 216. 2003. [9] C. Wang, N. Cao, K. Ren, and W. J. Lou, “Enabling secure and effi- cient ranked keyword search over outsourced cloud data,” IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 8, pp. 1467–1479, Aug. 2012. Fig. 12. Search precision. Fig. 13. Rank privacy. CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 961
  • 12. [10] A. Swaminathan, Y. Mao, G. M. Su, H. Gou, A. Varna, S. He, M. Wu, and D. Oard, “Confidentiality-preserving rank-ordered search,” in Proc. ACM ACM Workshop Storage Security Survivability, Alexandria, VA, 2007, pp. 7–12. [11] S. Zerr, D. Olmedilla, W. Nejdl, and W. Siberski, “Zerber+R: Top- k retrieval from a confidential index,” in Proc. 12th Int. Conf. Extending Database Technol.: Adv. Database Technol., Saint Petersburg, Russia, 2009, pp. 439–449. [12] C. Wang, N. Cao, J. Li, K. Ren, and W. J. Lou, “Secure ranked key- word search over encrypted cloud data,” in Proc. IEEE 30th Int. Conf. Distrib. Comput. Syst., Genova, ITALY, 2010, pp. 253–262. [13] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keyword search over encrypted data,” in Proc. Proc. 2nd Int. Conf. Appl. Cryptography Netw. Security, Yellow Mt, China, 2004, pp. 31–45. [14] L. Ballard, S. Kamara, and F. Monrose, “Achieving efficient con- junctive keyword searches over encrypted data,” in Proc. 7th Int. Conf. Inform. Commun. Security, Beijing, China, 2005, pp. 414–426. [15] R. Brinkman, “Searching in encrypted data” in University of Twente, PhD thesis, 2007. [16] Y. H. Hwang and P. J. Lee, “Public key encryption with conjunc- tive keyword search and its extension to a multi-user system,” in Proc. 1st Int. Conf. Pairing-Based Cryptography, Tokyo, JAPAN, 2007, pp. 2–22. [17] H. Pang, J. Shen, and R. Krishnan, “Privacy-preserving similarity- based text retrieval,” ACM Trans. Internet Technol., vol. 10, no. 1, pp. 39, Feb. 2010. [18] N. Cao, C. Wang, M. Li, K. Ren, and W. J. Lou, “Privacy-preserving multi-keyword ranked search over encrypted cloud data,” in Proc. IEEE INFOCOM, Shanghai, China, 2011, pp. 829–837. [19] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and H. Li, “Privacy-preserving multi-keyword text search in the cloud sup- porting similarity-based ranking,” in Proc. 8th ACM SIGSAC Symp. Inform., Comput. Commun. Security, Hangzhou, China, 2013, pp. 71–82. [20] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, “Dynamic authenticated index structures for outsourced databases,” in Proc. ACM SIGMOD, Chicago, IL, 2006, pp. 121–132. [21] H. H. Pang and K. L. Tan, “Authenticating query results in edge computing,” in Proc. 20th Int. Conf. Data Eng., Boston, MA , 2004, pp. 560–571. [22] C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong, and S. G. Stubblebine, “A general model for authenticated data structures,” Algorithmica, vol. 39, no. 1, pp. 21–41, May 2004. [23] C. M. Ralph, “Protocols for public key cryptosystems,” in Proc. IEEE Symp. Security Priv, Oakland, CA, 1980, pp. 122–122. [24] R. C. Merkle, “A certified digital signature,” in Proc. Adv. cryptol. , 1990, vol. 435, pp. 218–238. [25] M. Naor and K. Nissim, “Certificate revocation and certificate update,” IEEE J. Sel. Areas Commun., vol. 18, no. 4, pp. 561–570, Apr. 2000. [26] H. Pang and K. Mouratidis, “Authenticating the query results of text search engines,” in Proc. VLDB Endow., vol. 1, no. 1, pp. 126–137, Aug. 2008. [27] C. Chen, X. J. Zhu, P. S. Shen, and J. K. Hu, “A hierarchical cluster- ing method For big data oriented ciphertext search,” in Proc. IEEE INFOCOM, Workshop on Security and Privacy in Big Data, Toronto, Canada, 2014, pp. 559–564. [28] S. C. Yu, C. Wang, K. Ren, and W. J. Lou, “Achieving secure, scal- able, and fine-grained data access control in cloud computing,” in Proc. IEEE INFOCOM, San Diego, CA, 2010, pp. 1–9. [29] I. H. Witten, A. Moffat, and T. C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed. San Francisco, CA, USA : Morgan Kaufmann, 1999. [30] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. Berkeley Symp. Math. Stat. Prob., California, 1967, p. 14. [31] Z. X. Huang, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Min. Knowl. Discov., vol. 2, no. 3, pp. 283–304, Sep. 1998. [32] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, “Secure kNN computation on encrypted databases,” in Proc. ACM SIG- MOD Int. Conf. Manage. Data, Providence, RI, 2009, pp. 139–152. [33] R. X. Li, Z. Y. Xu, W. S. Kang, K. C. Yow, and C. Z. Xu, “Efficient Multi-keyword ranked query over encrypted data in cloud com- puting, Futur. Gener. Comp. Syst., vol. 30, pp. 179–190, Jan. 2014. [34] G. Craig, “Fully homomorphic encryption using ideal lattices,” in Proc. 41st Annu. ACM Symp. Theory Comput., 2009, vol. 9, pp. 169–178 [35] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Public key encryption with keyword search[C],” in Proc. Adv. Cryptol., Berlin, Heidelberg, 2004, pp. 506–522. [36] D. Cash, S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M. Steiner, “Highly-scalable searchable symmetric encryption with support for Boolean que-ries,” in Proc. Adv. Cryptol,. Berlin, Hei- delberg, 2013, pp. 353–373. [37] S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchable symmetric encryption,” in Proc. Conf. Comput. Commun. Secur., 2012, pp. 965–976. [38] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchable symmetric encryption: Improved definitions and efficient construc-tions,” in Proc. 13th ACM Conf. Comput. Commun. Secur., 2006, pp. 79–88. [39] M. Chase and S. Kamara, “Structured encryption and controlled disclosure,” in Proc. Adv. Cryptol., 2010, pp. 577–594. [40] D. Cash, J. Jaeger, S. Jarecki, C. Jutla, H. Krawczyk, M. C. Rosu, and M. Steiner, “Dynamic searchable encryption in very large databases: Data structures and implementation,” in Proc. Netw. Distrib. Syst. Security Symp., vol. 14, 2014, Doi: http://dx.doi.org/ 10.14722/ndss.2014.23264. [41] S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M. Steiner, “Outsourced symmetric private information retrieval,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Nov. 2013, pp. 875–888. Chi Chen received the BS and MS degrees from Shandong University, Jinan, China, in 2000 and 2003, respectively, and the PHD degree from the Institute of Software Chinese Academy of Sci- ences, Beijing, China in 2008. He is an associate research fellow of the Institute of Information Engineering, Chinese Academy of Sciences. His research interest includes the cloud security and database security. From 2003 to 2011, he was a research apprentice, research assistant, and associate research fellow with the State Key Lab- oratory of Information Security, institute of software Chinese Academy of Sciences. Since 2012, he is an associate research fellow with the State Key Laboratory of Information Security, institute of information engineer- ing, Chinese Academy of Sciences, Beijing, China. He is a member of the IEEE. Xiaojie Zhu received the BS degree in the Zhejiang University of Technology, HangZhou, China, in 2011. He is currently working towards the MS degree in the Institute of Information Engineering, Chinese Academy of Sciences. His research interest includes the information retrieval, secure cloud storage, and data security. He is a student member of the IEEE. Peisong Shen received the BS degree in the University of Science and Technology of China, HeFei, China, in 2012. He is currently working towards the PhD degree in the Institute of Informa- tion Engineering, Chinese Academy of Sciences. His research interest includes the information retrieval, secure cloud storage, and data security. He is a student member of the IEEE. Jiankun Hu is a professor and research director of the Cyber Security Lab, School of Engineering and IT, The University of New South Wales, Can- berra, Australia. His main research interest is Cyber security with focus on bio-cryptography, and anomaly intrusion detection. He has obtained seven ARC (Australian Research Council) Grants and has served at the prestigious Panel of mathe- matics, information and computing sciences, ARC ERA Evaluation Committee. He is a mem- ber of the IEEE. 962 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016
  • 13. Song Guo received the PhD degree in computer science from University of Ottawa, Canada. He is currently a Full Professor at School of Computer Science and Engineering, the University of Aizu, Japan. His research interests are mainly in the areas of wireless communication and mobile computing, cyber-physical systems, data center networks, cloud computing and networking, big data, and green computing. He has published over 250 papers in referred journals and confer- ences in these areas and received three IEEE/ ACM best paper awards. Dr. Guo currently serves as Secretary of IEEE ComSoc Technical Committee on Satellite and Space Communication (TCSSC) and Technical Subcommittee on Big Data (TSCBD), Associate Editor of IEEE Transactions on Parallel and Distributed Systems (TPDS), IEEE Transactions on Emerging Topics (TETC) for the Compu- tational Networks Track, and on editorial boards of many others. He has also been in organizing and technical committees of numerous interna- tional conferences and workshops. Dr. Guo is a senior member of the IEEE and the ACM. Zahir Tari received the degree in mathematics from the University of Science and Technology Houari Boumediene, Bab-Ezzouar, Algeria, in 1984, the masters degree in operational research from the University of Grenoble, Grenoble, France, in 1985, and the PhD degree in computer science from the University of Grenoble, in 1989. He is a professor, in distributed systems, at RMIT University, Melbourne, Australia. Later, he joined the Database Laboratory at EPFL (Swiss Federal Institute of Technology, 1990-1992) and then moved to QUT (Queensland University of Technology, 1993-1995) and RMIT (Royal Melbourne Institute of Technology, since 1996). He is the head of the DSN (Distributed Systems and Networking) at the School of Computer Scienceand IT, where he pursues high-impact research and development in computer science. He leads a few research groups that focus on some of the core areas, including networking (QoS routing, TCP/IP congestion), distributed systems (performance, security, mobility, reliability), and distributed applications (SCADA, Web/Internet applica- tions, mobile applications).His recent research interests are in perfor- mance (in Cloud) and security (in SCADA systems). He regularly publishes in prestigious journals (like IEEE Transactions on Parallel and Distributed Systems, IEEE Trans-actions on Web Services, ACM Trans- actions on Databases) and conferences (ICDCS, WWW, ICSOC, etc.). He co-authored two books (John Wiley) and edited more than 10 books. He has been the program committee chair of several international conferences, including the DOA (Distributed Object and Appli-cation Symposium), IFIP DS 11.3 on Database Security, and IFIP 2.6 on Data Semantics. He has also been the general chair of more than 12 conferences. He is the recipient of 14 ARC (Australian ResearchCouncil) grants. He is a senior member of the IEEE. Albert Y. Zomaya is currently the chair professor of high-performance computing and networking and Australian Research Council professorial fellow in the School of Information Technologies, The University of Sydney, Sydney, Australia. He is also the director of the Centre for distributed and high-performance computing which was established in late 2009. He is the author/co-author of seven books, more than 370 papers, and the editor of nine books and 11 conference proceedings. He is the editor in chief of the IEEE Transactions on Computers and serves as an associate editor for 19 leading journals. He is the recipient of the Meritorious Service Award in 2000 and the Golden Core Recogni- tion in 2006, both from the IEEE Computer Society. He is a chartered engineer (CEng), a fellow of the AAAS, the IEEE, the IET (UK), and a distinguished engineer of the ACM. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. CHEN ET AL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 963