You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
JOURNAL OF LATEX CLASS
IEEE Transactions onFILES, VOL. 13, NO.
Dependable and9,Secure
SEPTEMBER 2014
Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 ) 1

Searchable Encryption over Feature-Rich Data


Qian Wang, Member, IEEE, Meiqi He, Minxin Du,
Sherman S. M. Chow, Member, IEEE, Russell W. F. Lai, and Qin Zou Member, IEEE

Abstract—Storage services allow data owners to store their huge amount of potentially sensitive data, such as audios, images, and
videos, on remote cloud servers in encrypted form. To enable retrieval of encrypted files of interest, many searchable symmetric
encryption (SSE) schemes have been proposed. However, most existing SSE solutions construct indexes based on keyword-file pairs
and focus on boolean expressions of exact keyword matches. Moreover, most dynamic SSE solutions cannot achieve forward privacy
and reveal unnecessary information when updating the encrypted databases.
We tackle the challenge of supporting large-scale similarity search over encrypted feature-rich multimedia data, by considering the
search criteria as a high-dimensional feature vector instead of a keyword. Our solutions are built on carefully-designed fuzzy Bloom
filters which utilize locality sensitive hashing (LSH) to encode an index associating the file identifiers and feature vectors. Our schemes
are proven to be secure against adaptively chosen query attack and forward private in the standard model. We have evaluated the
performance of our scheme on various real-world high-dimensional datasets, and achieved a search quality of 99% recall with only a
few number of hash tables for LSH. This shows that our index is compact and searching is not only efficient but also accurate.

Index Terms—Cloud storage, searchable encryption, homomorphic encryption, similarity search, proximity search

F
1 I NTRODUCTION
1.1 Related Work
S TORAGE services have motivated data users, including
both individuals and enterprises, to outsource their (po-
tentially huge amount of) data to remote servers (e.g., cloud) The first full-text SSE scheme was proposed by Song et al. [1]
to save the expensive local storage and management costs. with search time linear in the length of the file collection.
However, outsourced data is no longer under the direct Curtmola et al. [3] generalized security definitions of SSE
physical control of the data owner. Sensitive data, such as and proposed a construction based on the inverted index.
personal files, commercial secrets, and healthcare records, The search time is linear in the number of files that contain
should be encrypted locally before outsourcing. Data en- keyword w which is considered as optimal. Kamara et al. [2]
cryption, however, renders data utilization a challenging extended it to dynamic data which supports both addition
task. In particular, it becomes difficult to retrieve data of and removal of files. Kamara et al. [6] proposed a paralleliz-
interest based on their content as in the plaintext search. able and dynamic SSE scheme using red-black trees.
To enable users to efficiently retrieve encrypted data The above dynamic SSE schemes cannot achieve for-
of interest from a remote storage server, the notion of ward privacy, which means the search token allows linking
searchable symmetric encryption (SSE) was proposed [1] the files to be added in the future. The scheme of Ste-
and design of index-based SSE schemes has received much fanov et al. [7] is forward private. Although the theoretical
attention recently. The secret key in SSE can generate search cost is sub-linear, the actual search time cost will be large
tokens to perform search queries over encrypted data. Dif- when the number of document-keyword pairs is large.
ferent formal security notions have been proposed (e.g., Solutions [4], [8], [9], [10] supporting boolean expres-
see [2]). By sacrificing access pattern and/or search pattern, sions on keywords have also been proposed. Performing
most SSE schemes achieve highly efficient searches while conjunctive queries over encrypted data was first consid-
maintaining privacy guarantee from a practical perspective. ered by Golle et al. [8]. Their approach works by testing
Most existing SSE solutions [2], [3], [4], [5] constructed each encoded file against a set of tokens, so the complexity
keyword-based search indexes (e.g., linked lists based on grows linearly with the number of files. This is also a
keyword-file pairs), supporting only efficient exact but general problem of public-key scheme [9]. For supporting
not similarity searches. Moreover, most dynamic schemes generic boolean search over the encrypted data, Moataz and
cannot guarantee forward privacy, which requires that the Shikfa [10] proposed a new SSE scheme which is also based
database server cannot learn whether a newly added data on vectors (but not for similarity search). By transferring
file contains a keyword w that has been searched in the past. keywords to vectors and using the Gram-Schmidt process
to orthogonalize and orthonormalize them, their scheme
leverages the inner product of vectors to execute the en-
• Q. Wang, M. Du, and Q. Zou are with The State Key Lab of Software
Engineering, School of Computer Science, Wuhan University, China. crypted search. The work of Cash et al. [4] first retrieves the
E-mail: {qianwang,duminxin,qzou}@whu.edu.cn results for one of the query keywords, and then filters the
• M. He is with the Department of Computer Science, The University of results according to the given boolean query against each
Hong Kong, Hong Kong.
E-mail: mqhe@hku.hk
of remaining keywords. The above solutions only support
• S. Chow and R. Lai are with the Department of Information Engineering, keyword-based exact searches and do not allow similarity
The Chinese University of Hong Kong, Hong Kong. search over encrypted feature-rich data.
E-mail: {sherman,wflai}@ie.cuhk.edu.hk
As a generalization of SSE, structured encryption is

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2

proposed by Chase and Kamara [11]. The data structures means that one needs to tag the data with many possible
supported are matrices and graphs. This is further gener- keywords, which may not be practical.
alized by Lai and Chow [12] which supports any abstract Based on similar approaches, Wang et al. [21] proposed
data type in the form of query and response pair. However, a multi-keyword fuzzy search scheme over encrypted data
utilizing their scheme in this setting means one needs to by using LSH and secure k nearest neighbors (k NN) [22].
pre-compute all possible queries, which does not satisfy our Still, this scheme lacks formal and thorough analysis under
goal to support unforeseen queries. a well-defined and rigorous security model as previously
pointed out [22]. In contrast, most existing SSE schemes
from the cryptography community have their security rig-
1.2 Similarity Search
orously analyzed [1], [2].
Similarity search has been widely used in the information
retrieval community for content-based search of feature-
rich data (e.g., image, audio, and video files), and other 1.3 Our Contributions
queries such as finding close/similar points in certain metric
spaces [13]. The feature-rich data objects are usually repre- This paper tackles the problem of constructing a secure
sented as high-dimensional vectors in a metric space. Dis- similarity index for efficient approximate members search
tances between objects are measured by Euclidean distance over large-scale encrypted feature-rich data. The key idea
or other distance metrics. In contrast to exact search, similar- is to transform the data into a set of feature vectors, which
ity search can relax constraints on users’ requests to retrieve are further mapped by LSHs to an array position. We call
a list of desirable results based on likely relevance. Similarity the entry there a bucket, which will be pointing to the
search also tolerates uncertain and inaccurate search inputs, possible matching files. Since the LSH sends all similar files
thus improves usability and search experience in general. to the same output, together with our special treatment
With the increasing popularity of sharing multimedia of propagation (to be explained), it hence creates a cluster
data in social networks, the demand for similarity search of buckets which contains files approximately close to the
over encrypted multimedia data is also increasing. For ex- query object with a high probability.
ample, users of social networks may want to identify and To represent the pointers to the file, we store the inverted
view multimedia files shared by their friends who enforced file identifier vectors (IFVs) in each of this bucket. An IFV is
access control. Yet, similarity search has received attention a vector which represents the files that fall in a given bucket
mostly in the plaintext domain [14], [15]. for one hash function (among the set of hash functions
Below we discuss three recent approaches for estab- in the LSH). For a trade-off of computation efficiency and
lishment of similarity indexes while providing privacy- bandwidth, we have two options in encrypting the resulting
preserving content-based searches. Earlier approaches can IFVs, which are additive homomorphic encryption (for sav-
be found in the references within. ing bandwidth) or pseudo-random functions/permutations
Kuzu et al. [16] investigated the use of locality sensitive (for more efficient computation).
hashing (LSH) family for supporting similarity search based The resulting schemes enable the client to perform
on Jaccard distance, and used computationally-expensive privacy-preserving similarity search by checking only a few
methods to encrypt the index. The main problem of their number of hash tables (or buckets). The response set consists
scheme stems from the embedding technique for transform- of approximate files which can be readily used for privacy-
ing the metric space on edit distance to the Euclidean space. preserving k NN or approximate nearest neighbors (ANN)
It is proven that existing embedding approaches cannot search. Another important and desirable property of our
provide sufficient distance preservation after space transfor- index constructions is that they are highly compact, neat
mation, and lead to intolerable search errors [17]. Another and efficient, and they can also efficiently support secure
limitation is the large number of hash tables required to dynamic updates of file and index with minimal operations.
cover the nearest objects. When this space requirement Our main contributions are summarized as follows.
exceeds the main memory size, looking up the hash tables 1) We study the problem of similarity search over en-
may cause substantial delay due to disk I/O access [18]. crypted feature-rich data outsourced to remote servers. We
Moreover, it lacks a rigorous performance analysis, i.e., there characterize the unique security requirements and formally
is no theoretical characterization of the search quality. define the security notions for encrypted similarity search.
Yuan and Yu [19] proposed a privacy-preserving biomet- We then present design of our encrypted indexing schemes
ric identification for large-scale encrypted databases. The with two instantiations supporting privacy-preserving high-
search complexity, however, is approximately O(mn2 ) for dimensional similarity search and data dynamics.
the exact search, where m denotes the dataset size and n is 2) We present a thorough theoretical analysis and charac-
the vector dimension. The security of their scheme relies on terize the false positive and false negative rates of our con-
heuristic argument that security comes from the “encryp- structions. We prove the semantic security of our schemes
tion” of feature vectors by randomly-generated matrices. It under the adaptively chosen query attack (CQA2) model
is possible that the introduced randomness can be removed with minimal leakage using simulation-based definitions
in the presence of collusion. In addition, Moataz et al. [20] and proofs. We also show that they achieve forward privacy.
made the first attempt at performing similarity search over 3) We evaluate our constructions against three real-world
text keywords by combining the searchable encryption tech- datasets, with different sizes and qualities. which showed
nique with a stemming algorithm to achieve fuzzy searches that they are both time- and space-efficient, and they pro-
over encrypted data. Applying their technique to our setting vide high search quality in terms of recall and error ratio.

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

2 P ROBLEM S TATEMENT AND P RELIMINARIES abilistic) protocol run between the client and the server. The client
2.1 Similarity Search over Encrypted Feature-rich Data side takes as input a secret key sk and an update object u, while
the server side takes as input the encrypted database γ . Upon
We consider a data owner, also known as the client, with
completion of the protocol, the client obtains nothing, while the
some large-scale and sensitive multimedia data to be out-
server obtains an updated encrypted database γ 0 .
sourced to a cloud-based storage system. The client encrypts
the data locally before outsourcing. The collection of multi-
media data is denoted by f = (f1 , . . . , f#f ), and its corre- 2.2 Security Definitions
sponding encrypted form is denoted by c = (c1 , . . . , c#c ). Similarity search over encrypted feature-rich data should
To enable similarity search over c for effective informa- provide the following security guarantees. First, an ad-
tion retrieval, the data owner builds an encrypted index I versary cannot generate search tokens for feature vectors
based on f and encrypts f to obtain c, then outsources the without the secret key. Second, the search tokens generated
encrypted data γ = (I, c) to the remote cloud server. Later, for the similarity indexes do not reveal any information of
the client issues a query in the form of a search object q . This the original search objects (i.e., feature vectors of files of
object q will be represented by a feature vector throughout interest) beyond what is implied by the search result. Third,
the rest of the paper, which is a multi-dimensional vector before and after similarity searches, the server learns only
of numerical features that represents some object (e.g., some what are allowed to be leaked by the data owner, i.e., the
selected key scene of a video file, color histograms for image search pattern and access pattern.
data [23], etc). The query may just near to some of the files We follow the widely-accepted framework to analyze the
but may not exactly match with any of them. To this end, the security of SSE [2], [3]. The security of an SSE scheme can be
client generates a search token τq for q and sends it to the characterized via the notion of history, search pattern, and
cloud server. After receiving τq , the cloud uses it to query I access pattern [3]. Below we will give their definitions with
and returns all the encrypted file IDs whose corresponding respect to the SSE schemes we are going to propose.
files cq have characteristics similar to the search object q .
The system is dynamic, which means that the client may Definition 2. (History Q). An s-query history over f is a vector
add or remove multiple files u = f 0 to or from the servers at of s queries Q = {q1 , . . . , qs }.
any time. To do so, the client generates an update token τu . Definition 3. (Search Pattern π ). Given a search object q at
The server can then use τu to securely update c and I . time t, the search pattern π(f , q) is defined by a binary vector of
Formally, the core functionalities are defined below. length t with a ‘1’ at location i if the search at time i ≤ t was
Definition 1. An encrypted (multimedia) data storage system for q ; ‘0’ otherwise.
supporting similarity search and dynamic update consists of the The search pattern reveals whether the same search was
following seven polynomial-time algorithms/protocols: performed in the past or not.
sk ← Gen(1λ ): is a probabilistic key generation algorithm run Definition 4. (Access Pattern Ap ). Given a search object q at
by the client. It takes as input a security parameter λ and outputs time t, the access pattern is defined by Ap (f , q) = {id(cq )}.
the secret key sk .
Now we come to the main definition asserting the secu-
γ ← Enc(sk, f ): is a probabilistic algorithm run by the client. It rity of an SSE scheme. Our definition follows the simulation-
takes as input a secret key sk and a (multimedia) data collection based framework [3], parameterized by the leakage required
f , and outputs an encrypted database γ . for the ideal world simulation [11].
An encrypted database is defined by the tuple (c, I), where c is Definition 5. (CQA2-security for SSE). Let L1 , L2 , and L3
a ciphertext, each of them encrypting the corresponding plaintext be leakage functions for setup, search, and update respectively,
in f , and I is an encrypted index, which supports querying which served as the parameters for the definition [11]. A searchable
the ciphertext in c via the Search protocol. Conceptually, this encryption scheme SSE is said to be secure against adaptively
algorithm consists of two disjoint operations: the index generation chosen query attack (CQA2), if for any PPT stateful adversary A,
and the encryption of the plaintext. there exists a PPT stateful simulator S such that for the following
f ← Dec(sk, c): is a deterministic algorithm run by the client. probabilistic experiments RealSSEA (λ) and IdealSSE A,S (λ), we
It takes as input a secret key sk and a ciphertext c, and outputs a have
file f . | Pr[RealSSE (λ) = 1] − Pr[IdealSSE
A A,S (λ) = 1]| ≤ negl(λ).

(cq ; ⊥) ← Search(sk, q; γ): is a (possibly interactive and prob- RealSSE (λ): The challenger runs Gen(1λ ) to generate the secret
A
abilistic) protocol run between the client and the server1 . The key sk . A outputs f and receives γ ← Enc(sk, f ) from the
client side takes as input a secret key sk and a search object q , challenger. A makes a polynomial number of adaptive queries q
while the server side takes as input the encrypted database γ . or updates u. For each query q , the challenger acts as a client
Upon completion of the protocol, the client obtains a sequence of and runs Search with A acting as a server. For each update u,
ciphertexts cq while the server has no local output. the challenger acts as a client and runs Update with A acting
as a server. Finally, A returns a bit b, which is the output of the
(⊥; γ 0 ) ← Update(sk, u; γ): is a (possibly interactive and prob-
experiment.
1. A protocol P run between the client and the server is denoted by
(u; v) ← P (x; y), where x and y are the respective inputs, and u and v IdealSSE
A,S (λ): A outputs f and sends it to S . Given L1 (f ),
are the respective output, of the client and the server. S generates and sends γ to A. A makes a polynomial number

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4

of adaptive queries q or updates u. For each query q , S is we need to enlarge the gap between the two probabilities, i.e.,
given L2 (f , q), and simulates a client who runs Search with A making P1  P2 . In practice, R can be pre-set by users in
acting as an honest server. For each update u, S is given L3 (f , u), the construction of similarity indexes to meet the requirements
and simulates a client who runs Update with A acting as an of different applications. In Section 5, we carefully choose R based
honest server. Finally, A returns a bit b, which is the output of on the characteristics of each dataset.
the experiment.
In this paper, we choose l2 norm, i.e., the Euclidean norm
||x||2 = ( d 2 1/2
P
The bit b output by both the real and ideal experiments, i=1 xi ) , for LSH using 2-stable distribution
which is originated from the adversary A, can be seen as in our construction. Specifically, each hash function is de-
the decision of A in distinguishing whether it is in the fined by
real world or in the ideal world. The intuition is that, a·q+b
ha,b = b c,
since only the leakages are enough to simulate the ideal W
world, any scheme satisfying this definition will only let any where a is a d-dimensional vector with entries chosen inde-
computationally-bounded adversary to learn the leakage pendently from a 2-stable distribution: Normal distribution
2
but not the other information in the real world. g(x) = √12π e−x /2 , and b is chosen uniformly from [0, W ].
The above definition ensures adaptive security when the Note that, W should not be too large, otherwise P1 and P2
query objects are chosen as a function of the encrypted index will both go to 1. In practice, W should be chosen (not too
and the results of previous queries. The leakage functions small) to enlarge the gap between P1 and P2 given a fixed c.
L1 , L2 , and L3 will be defined for specific schemes. More details can be found in the literature [24].
We proceed to define forward privacy. Although some We remark that the LSH family aims at mapping similar
existing schemes (e.g., [7]) is forward private, to the best objects into the same or the neighboring buckets. It is not
of our knowledge, a formal definition of the notion is never meant to be a cryptographic one which aims at providing
explicitly given. We therefore give our own definition below. collision resistance for instance.
Informally, forward privacy means that, for any search
object q , the knowledge of the server learned from previous Pseudo-random function (PRF) and permutation (PRP).
execution of the search protocol on q does not help the Let G : {0, 1}λ × {0, 1}poly(λ) → {0, 1}poly(λ) be a PRF,
server decide whether a newly added ciphertext c relates to which is a polynomial-time computable function that cannot
q or not. Our definition requires that any two “add” updates be distinguished from random function by any polynomial-
of the same size are indistinguishable. time adversary [25], and F be a PRP, which is a PRF but the
function is also bijective.
Definition 6. (Forward Privacy). A searchable encryption Looking ahead, in our construction, for a given key k ,
scheme SSE is said to be forward private, if for any PPT stateful the range of Fk (·) will be as large as the encrypted index;
adversary A, we have and Gk (·) takes as input a bit-string which can be as large
1 as a double of the output length of Fk (·), and outputs a bit
Pr[FwdPrvSSE
A (λ) = 1] ≤ + negl(λ).
2 string of length #f .

Homomorphic encryption. Homomorphic encryption al-


FwdPrvSSE A (λ): The challenger runs Gen(1λ ) to generate the lows certain computations to be carried out on cipher-
key sk . A outputs f and receives γ ← Enc(sk, f ) from the texts to generate an encrypted result which matches the
challenger. A makes a polynomial number of adaptive queries q result of operations performed on the plaintext after being
or updates u. For query q , the challenger acts as a client and runs decrypted. Although homomorphic encryption for general
Search with A acting as a server. For update u, the challenger operations is prohibitively slow, homomorphic encryption
acts as a client and runs Update with A acting as a server. for only summation operations is efficient, such as Paillier
Eventually, A outputs two “add” updates ui = (“Add00 , fui ) cryptosystem [26] which is additively homomorphic. It is
such that |f0 | = |f1 | for any fi ∈ fui , i = 0, 1. The challenger secure against chosen plaintext attack (CPA) assuming the
picks a random bit b and runs Update with update object ub . decisional composite residuosity problem is hard [26].
Finally, A returns a bit b0 and the experiment returns |b − b0 |. In Paillier cryptosystem, the public (encryption) key is
pkp = (n = pq, g), where g ∈ Z∗n2 , and p and q are
two large prime numbers (of equivalent length) chosen
2.3 Preliminaries
randomly and independently. The private (decryption) key
Our constructions use the following basic primitives. is skp = (ϕ(n), ϕ(n)−1 mod n).
Given a message a, we write the encryption of a as
Locality sensitive hash (LSH).
JaKpkp , or simply JaK. The encryption of a message x ∈ Zn
Definition 7. [24] Locality sensitive hash (LSH) family is JxK = g x · rn mod n2 , for some random r ∈ Z∗n .
H = {hj : {0, 1}d → {0, 1}t |j = 1, . . . , M } is called The decryption of the ciphertext is x = L(JxKϕ(n) mod
(R, cR, P1 , P2 )-sensitive for a distance function || ∗ || if for any n2 ) · ϕ−1 (n) mod n, where L(u) = u−1 n . The homomorphic
p, q ∈ {0, 1}d and for any j ∈ 1, . . . , M , property of the Paillier cryptosystem is given by Jx1 K·Jx2 K =
(g x1 · r1n ) · (g x2 · r2n ) = g x1 +x2 (r1 r2 )n mod n2 = Jx1 + x2 K.
• if ||q − p|| ≤ R then Pr[hj (q) = hj (p)] ≥ P1 ;
• if ||q − p|| ≥ cR then Pr[hj (q) = hj (p)] ≤ P2 . 2.4 Notations
To use LSH for similarity search, we should at least ensure Standard bitwise Boolean operations
W are defined on binary
that c > 1 and P1 > P2 . To reduce the false positive rate, vectors: bitwise OR (union) “ ” and bitwise AND (intersec-

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
IEEE Transactions on Dependable and Secure Computing Transactions on Dependable 15
( Volume: and ,Secure Computing
Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5
V
tion) “ ”, which take the same notations as single bit OR 3.2.1 Initialization and Index Construction
and AND operations, respectively. λ
Gen(1 ): Given a security parameter λ, choose the following
Feature-rich data objects are typically represented as uniformly at random from the corresponding domains:
high-dimensional vectors of numerical features. We use pi to
$
denote the d-dimensional feature vector extracted from data • a PRP key k1 ← {0, 1}λ for Fk1 (·),
file i. When representing images, the feature values might • a symmetric key k2 ← SKE.Gen(1λ ) for a CPA-
correspond to the pixels of an image, when representing secure symmetric encryption SKE,
texts perhaps term occurrence frequencies. • functions (h1 , h2 , . . . , hM ) from the LSH family H.
[n] denotes the set of positive integers less than or equal
to n, i.e., [n] = {1, 2, . . . , n}. The M hash functions will be used to generate M hash
tables, each consists of m buckets. The output of this al-
gorithm sk = (k1 , k2 , m, h1 , h2 , . . . , hM ).
2.5 Inverted File Vectors
Enc(sk, f ):
Given a vector v, we refer its ith element as v[i] or vi . We
use ei to denote a unit basis vector which is a bit-vector of 1) [Initialize a temporary array of buckets]
length #f having the ith entry being 1; 0 otherwise. Let x1 , . . . , xM m be the positions of all buckets. For
We use fidj ∈ {0, 1}#f to denote the inverted file identifier each xj , store 0 (a bit-vector of length #f having
vector (IFV) placed at the j th bucket. In fidj , the ith entry all 0 bits) in I[xj ].
of IFV, i.e., fidj [i], is 1 if ei has been mapped into the j th 2) [Prepare the feature-to-file index according to the
bucket; 0 elsewhere . We use fid c to denote a joint IFV com- buckets given by LSH]
puted from the “union” of multiple IFVs, to be explained in For each file fi ∈ f ,
the next section. IFV is a cornerstone of our constructions. (a) Obtain its corresponding d-dimensional feature
vector pi using the feature extraction algorithm for
the specific data type (e.g., color histograms for
3 O UR C ONSTRUCTIONS image data [23]).
This section presents our index data structure for similarity (b) Compute the M hash bucket positions of the
search over feature-rich encrypted multimedia data. LSH (x1 = g1 (pi ), . . . , xM = gM (pi )), where

gj (pi ) = hj (pi ) mod m + m(j − 1) for j ∈ [M ]


3.1 Overview
(c) For each bucket position xj , update the bucket by
Our key idea is an encoding approach based on inverted storing the bitwise OR of the unit basis vector ei and
file identifier vectors (IFVs), bit-vectors representing a set the IFV
of files, which allow identification of files with features W previously stored in the bucket, denoted by
I[xj ] ei .
most similar to the query feature vector. We map high- 3) [Propagate the IFVs to the neighbor]
dimensional points onto a line by using an LSH as a basis
to put similar objects into the same or adjacent buckets. We W stored in I[xj ].
For each bucket xj , let fidj beWthe IFV
Compute the “union” (fid− j fidj fid+j ) as a joint
aim at minimizing the space overhead. At the same time, − +
we also aim at maintaining an acceptable level of collision IFV fidj , where fidj and fidj denote IFVs stored in
c

probability of nearby objects and increasing the chance of the neighboring left and right buckets, respectively.
c j in I[xj ].
Store fid
finding similar data objects close to the query. To do so, we
generate a joint IFV fid
c by calculating the “union” of each 4) [Permute the buckets]
Permute array I according to the mapping xj to the
bucket and its two neighboring buckets, thus probing fid c is
randomized position Fk1 (xj ) for each position xj .
equivalent to probing three adjacent buckets in a shot. For
5) For 1 ≤ i ≤ #f , let ci ← SKE.Enck2 (fi ), where SKE
security, the buckets will be permuted through a pseudo
can be any secure encryption scheme.
random permutation as will be shown in our construction.
6) Output γ = (I, c), where c = (c1 , . . . , c#f ).
We proposed two schemes based on the above ideas by
encrypting fid
c using either Paillier homomorphic cryptosys- Dec(sk, c): Return f ← SKE.Deck2 (c).
tem (with ciphertext packing), or by XOR-ing bit vectors
with the outputs of a PRF. Intuitively, Scheme-I has lower 3.2.2 Search Over the Similarity Index
communication cost where the search results consists of The client searches by first generating a search token τq for
only one Paillier encryption aggregating M encrypted IFVs. the search object q . The server can then use the token to
Our Scheme-II is more computationally efficient due to the locate the buckets in the similarity index I and extract the
use of symmetric encryption, but it requires the server to corresponding joint IFVs. Finally, the server decodes the IDs
return M encrypted IFVs. More details can be found in the of files that are approximate members to the query q , and
experimental results in Table 5. returns corresponding ciphertexts to the client.

Search(sk, q; I):
3.2 Basic Construction without Index Privacy Client Side:
We first present our basic index construction, which does Extract the d-dimensional feature vector q based on the
not protect the privacy of IFVs, as follows. interest of the client. Compute (x1 = g1 (q), . . . , xM =

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions
JOURNAL on Dependable
OF LATEX CLASS and
FILES, VOL. 13, Secure
NO. Computing
9, SEPTEMBER 2014( Volume: 15 , Issue: 3 , May-June 1 2018 ) 6

gM (q)) and the search token τq = (Fk1 (x1 ), . . . , Fk1 (xM )). Original IFV
Extended IFV to be
encrypted
Send the search token τq to the server.
Server Side:
0 log 2 3M  - bit
Let y1 = Fk1 (x1 ), . . . , yM = Fk1 (xM ). Extract the cor- 0 Block
responding joint IFVs fidc i from I[yi ] for each i ∈ [M ]. 1 1
VM c 0 1024bit 0
Compute d = i=1 fidi . Return cq = {ci ∈ c : d[i] = 1}. 1 0
1 0
3.2.3 Supporting Efficient Index Dynamics 0
Pad 0's
Update(sk, f 0 ; γ):

...
...
Client Side:
To add new files f 0 , the client locally encrypts the files as 0
cf 0 and builds the sub-index If 0 for ff 0 (which has the same 0 0
1 0
structures as the similarity index on the server). To do so, 0
the client simply runs (If 0 , cf 0 ) ← Enc(sk, ff 0 ) and sets the 0
update token τu = (If 0 , cf 0 ). 1

To delete existing files f 0 , first search for their locations,


then generate τu = {i : fi ∈ f 0 }. Fig. 1: An Illustration of the packing technique
In either case, send the update token τu to the server.
Server Side:
For addition, merge If 0 into I and cf 0 into c, respectively. although in an encrypted form. This preserves the low
For deletion, to delete fi ∈ f 0 , set the ith entry of each communication cost (independent to M ) during the search.
row of I to 0, and delete cf 0 from c. Saving Space by Packing. Typical parameters for Paillier
encryption supports large (e.g., 1024-bit) integers as the
3.2.4 Illustrating Example plaintext space, but we only want to encrypt bit vectors
Suppose the index is built on six files f1 , . . . , f6 , three LSH of length #f . To pack a bit vector into multiple 1024-bit
h1 , h2 , h3 are randomly chosen, i.e., M = 3, m = 6. We have integers while leaving room for addition of individual bits,
g1 = h1 , g2 = h2 + 6 and g3 = h3 + 12. we use a block size of dlog2 3M e bits in the plaintext space
In the index construction, f1 , . . . , f6 are processed to to represent a bit in the bit vector as shown in Fig. 1. The
generate feature vectors pi , and their corresponding initial remaining bits in the plaintext space which are insufficient
IFVs (i.e., e1 , . . . , e6 ) are then stored into index array at buck- to form a block are set to 0. The packed integers are then
ets mapped by LSHs. Joint IFVs are then computed for each encrypted by the Paillier encryption.
bucket. Finally, all bucket positions are further permuted It is easy to see that one needs to allocate more than
randomly using PRP, i.e., Fk1 [g1 (pi )], Fk1 [g2 (pi )], Fk1 [g3 (pi )] dlog2 M e bits, to ensure that the summation of bit values in
(i ∈ {1, 2, 3, 4, 5, 6}). All the joint IFVs form the similarity one row of IFV across M different buckets will not affect the
index I to be stored on the server. aggregated values of other rows during the search process
To search, the client can first (process the file of in- (refer to the Steps 1 to 3 at the client side in Section 3.3.2).
terest to) create a d-dimensional vector q . Suppose f1 , f4 The reason to choose a block size of specifically dlog2 3M e
and f6 are some approximate near neighbors of q , and bits lies in our deletion mechanism. For deletion, we will
g1 (q) = g1 (p1 ) = g1 (p4 ) = g1 (p6 ) = 3, g3 (q) = g3 (p1 ) = set the second to last entry to 1 in the 1024-bit 0 vector
g3 (p4 ) = g3 (p6 ) = 15, but g2 (q) = g2 (p4 ) = g2 (p6 ) = 10 6= corresponding to the file fi to be deleted and then generate
g2 (p1 ) = 9. As described, IFVs in the neighboring buckets its encrypted sub-index Iui (as presented in Step 2 at the
are also explored by bitwise union operations. With k1 , client side in Section 3.3.3). The obtained sub-index Iui will
the client prepares the search token τq = (Fk1 [g1 (q)] = be used to perform additive homomorphic operations on
6, Fk1 [g2 (q)] = 3, Fk1 [g3 (q)] = 18), which enables the all the IFVs stored in the buckets (as presented in Step 2 at
server to locate IFVs corresponding to the approximate the server side). That is, upon finishing the delete updates,
near neighbors of query object q in I . Suppose further the last two entries of the block corresponding to fi in
that Fk1 [g2 (p1 )] = 13. The original bucket positions (i.e., all the buckets are changed to either “11” or “10”. Hence,
3, 9, 10, 15) are randomized by the PRP into the new ones for any future new search, the decimal value of entries
(i.e., 6, 13, 3, 18). Then the server extracts the joint IFVs in corresponding to fi in the aggregated form will be turned
the hit buckets, i.e., I[3] = (111101), I[6] = (100101) and to 3M or 2M , which will not incur a “collision” (i.e., it
I[18] = (100101), performs an intersection operation and will not affect the correctness of checking equality of the
gets (100101), which means f1 , f4 and f6 are indeed the corresponding entries for a new search). This is the reason
approximate members of q . why we should pad (dlog2 3M e − 1)-bit of 0’s before each
bit of IFV to obtain a plaintext block.
3.3 Scheme-I: Using Homomorphic Encryption
3.3.1 Initialization and Index Construction
The above basic construction supports efficient similarity Gen(1λ ): Additionally set up (skp , pkp ) for the Paillier en-
search without privacy. Based on it, we propose Scheme-I. cryption. Include (skp , pkp ) in sk .
In Scheme-I, each fid
c is encrypted using Paillier encryption,
so that the server can still aggregate the M intermediate Enc(sk, f ): Based on the basic index construction, additional
IFVs on behalf of the client as in the basic construction, steps are performed as follows: For each IFV stored in the

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7

index I[Fk1 (xi )] (i ∈ [M m]), use Paillier cryptosystem 2) For each file fi ∈ f 0 to be deleted, for the
to encrypt fid
c i under the public key pkp according to the i·dlog2 3M e th
1024 block in all the buckets I[x], and Ifi
packing method presented above. extracted from τu , compute I[x] ← I[x] · Ifi (by
leveraging the additively homomorphic property)
3.3.2 Search Over the Similarity Index and delete ci for all i ∈ i, where i = {i : fi ∈ f 0 }.
After located the hit encrypted IFVs, the server homo- This works since the decimal value of entries corre-
morphically sums up these encrypted IFVs and returns sponding to fi ∈ f 0 in I is changed from M to 3M
the resulting ciphertext to the client. Now for each query and others are turned to 2M .
object q , the server only needs to transfer one ciphertext
corresponding to the sum of the plaintext values (hit IFVs). If the last “blocks” of the original index I do not encode
The client then decrypts the ciphertext and unpacks with the information of dlog1024 files, the client selects part
2 3M e
the blocks according to the alignment set by the above of files from f 0 to construct the sub-index If0 0 to merge into
packing method, i.e., by interpreting the decryption result as the last “blocks” of the index I . This is to make sure that the
a concatenation of many identifiers as bit-strings, and then last “blocks” of the index I are encoded with dlog1024 3M e files. 2
separating them from it. In this way, the client effectively
delegates the public-key computation burden to the server 3.4 Scheme-II: Using Pseudorandom Padding
(i.e., the server integrates all encrypted IFVs into one while
the client only needs to do one decryption). To further improve the computation efficiency we next
introduce Scheme-II by leveraging a “one-time-pad” en-
Search(sk, q; γ): cryption instantiated by PRF. It is CPA-secure due to the
Client Side: pseudorandomness of the PRF, which in turns rely on its
Identical to that in the basic construction. one-wayness [25]. To avoid the reiteration of the same steps,
Server Side: we only present the key differences and modifications.
Let y1 = Fk1 (x1 ), . . . , yM = Fk1 (xM ). Extract cor-
responding encrypted fid c i from I[yi ] for each i ∈ [M ]. 3.4.1 Initialization and Index Construction
M λ
Compute the intermediate encrypted result θq =
Q
Jfid
ciK Gen(1 ): Besides the basic parameters such as m and those
$
i=1 for LSH, we need k1 , k2 , and an additional bit string k3 ←
and return it to the client.
{0, 1}λ chosen uniformly at random to serve as the key of
Client Side:
the PRF Gk3 (·). Output sk = (k1 , k2 , k3 , m, h1 , h2 , . . . , hM ).
1) Decrypt θq using the private key skp .
2) Unpack the decrypted message to obtain b = Enc(sk, f ): Based on the basic similarity index construction,
(b1 , . . . , b#f ), where bi is the binary representation the following additional step is added.
of a dlog2 3M e-bit integer. 3) For each IFV stored in the index I[Fk1 (xi )]
3) Retrieve the ciphertext corresponding to i = {i ∈
(i ∈ [M m]), encrypt fid c i as I[Fk (xi )] :=
[#f ] : bi = M } from the server. L 1

fidi Gk3 (IndexID||Fk1 (xi )), where || denotes the


c
Server Side: concatenation of strings and G(·) is a PRF. Here,
Return cq = {ci : i ∈ i} to the client. we use IndexID2 to uniquely identify the encrypted
We remark that the request of the ciphertext can be made similarity index I .
by using private information retrieval (PIR) protocol instead
which allows the transfer of ciphertexts without the server 3.4.2 Search Over the Similarity Index
learning the locations i.
Now for each query object, different from Scheme-I, the
server needs to locate and return all hit encrypted IFVs (i.e.,
3.3.3 Supporting Efficient Index Dynamics
M IFVs) to the client during the search process. At the client
Next, we show supporting efficient file updates can be side, after receiving the encrypted results, the client uses the
realized by our encrypted similarity index. secret key to decrypt and then decodes the IDs of the files
Update(sk, f 0 ; γ): which are similar to the query object q .
Client Side: Search(sk, q; γ):
1) Procedures for adding new files set f 0 are the same Client Side:
as in the basic construction. Identical to that in the basic construction.
2) To delete existing files f 0 , for each file fi ∈ f 0 Server Side:
initialize a 1024-bit 0 vector, and then set the (i mod Let y1 = Fk1 (x1 ), . . . , yM = Fk1 (xM ). Extract cor-
1024 th
dlog2 3M e ) ∗ dlog2 3M e − 1) entry to 1, generate the
responding IFVs from I[yi ] for each i ∈ [M ]. Return
corresponding encrypted sub-index Ifi by homo- r = (r1 = I[y1 ], . . . , rM = I[yM ]) to the client.
morphic encryption. At last, send the τu = (If 0 , i) Client Side:
to the server, where I 0 = {Ifi : fi ∈ f 0 }.
1) Compute Gk3 (IndexID||Fk1 (xi )) for i ∈ [M ].
Server Side:
2. The notion of IndexID is reasonable. In practice, if the data owner
1) Procedures for adding new files set f 0 are the same wants to build indexes for different combinations of multimedia data
as in the basic construction. files, then an IndexID can be used to uniquely identify each index.

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8

2) Decrypt r using Gk3 (IndexID||Fk1 (xi )) for i ∈ [M ] While the de facto standard of SSE efficiency is sub-linear
to obtain in f , those schemes only work for simple keyword search
M or a boolean expression based on keyword-comparison.
di = ri Gk3 (IndexID||Fk1 (xi ))
M Conceptually, those schemes process a selected set of key-
= I[yi ] Gk3 (IndexID||Fk1 (xi )) words instead of the files itself. Our aim is to support fuzzy
search across many large files, it is practically infeasible to
= fidci.
adopt this approach which requires the preparation of every
VM c
3) Compute d = i=1 fidi . possible “keywords” across a large number of files and
4) Send i = {i : d[i] = 1} to the server, or use PIR. tagging them accordingly. One could alternatively consider
verifying the similarities between each data file and the
Server Side: query object file. However, it either nullifies the advantages
Return cq = {ci : i ∈ i} to the client. of cloud computing for the client (if the processing is done
at the client side) or puts an extremely high burden to
3.4.3 Supporting Efficient Index Dynamics
the cloud (for the cloud to go through each of the large
Update(sk, f 0 ; γ): files to check if it matches with the query, in the encrypted
Client Side: domain). In our schemes, O(#f ) of computation is needed
Based on the basic similarity index construction, the due to the functionality we aim to support — answering
following modifications should be made for addition. unforeseen query and returning result due to similarities.
0 Under this setting, sub-optimal solution may miss some of
1) L by AddVec in the I , set
For each row vector denoted
AddVec[j] = AddVec[j] Gk3 (IndexID||j)[#f + i], the candidates. The elegancy in our design is that, while
where j ∈ [M m]. still requiring O(#f ) operation, we restrict it to the basic
2) Output the encrypted sub-index If 0 and c0 as τu . primitive operations as we previously discussed.
We leverage the cloud to do most of the computation
For deletion using τu , the following adaptions is made.
jobs. Compared with the (trivial) solution of returning
1) For each file fi to be deleted, generate a (M m)- O(#f ) of “objects” (anything after post-processing of the
dimension row vector DelVec, where each entry cloud on encrypted data, say, via fully homomorphic en-
L to 0. Then encrypt it as DelVec[j] =
is initialized cryption; or in the trivial case, the ciphertexts themselves)
#f
DelVec[j] Gk3 (IndexID||j)[i], where j ∈ [M m]. for the client to process, we obtain φ(d) times of savings for
2) Output the encrypted sub-index If 0 as τu . communication and storage costs, where φ(d) denotes the
Server Side: size of the returning candidate set. These significant savings
are also demonstrated numerically in Tables 2, 3, 4.
1) For file addition, append sub-index If 0 extracted
from τu to I and c0 into c, respectively.
2) For file deletion, to delete fid ∈ f 0 , set the ith row of 3.6 Security Analysis
I to the encrypted DelVec of fid extracted from If 0
and delete c0 from c accordingly. For both of our constructions, we define the leakage func-
tions as follows.
It is easy to see the correctness of the update.
Definition 8. (Leakage function L1 for setup). Given a multime-
#f
3.5 Efficiency Analysis dia data collection f , L1 (f ) = {|fi |}i=1 , where | ∗ | denotes the
length of the file.
For the search time efficiency, M hash operations are used to
generate the search token at the client side for both schemes. Definition 9. (Leakage function L2 for search). Given
#f ·dlog2 (3M )e·(M −1)
At the server side, for Scheme-I, 1024 addi- a multimedia data collection f , a search object q ,
tive homomorphic operations are applied to aggregate M L2 (f , q) = {π(f , q), Ap (f , q), E(q)}, where E(q) is
encrypted IFVs while Scheme-II only needs to locate the the identifiers of the buckets corresponding to q , i.e.,
buckets using the search token. For the client to decrypt E(q) = {id(g1 (q)), . . . , id(gM (q))}.
#f ·dlog2 (3M )e
the (intermediate) search results, (Paillier) de-
1024 Specifically, in the above definition, the access pattern
cryption operations, or M XOR operations over bit vectors
implicitly includes intermediate results, e.g., the encrypted
of length #f are needed for Scheme-I and Scheme-II re-
IFVs extracted from the hit buckets. We emphasize that
spectively. Besides, #f checking operations are involved to
the encrypted IFVs and their encrypted aggregation do
evaluate the equality of each decrypted entry to 1 or M .
not reveal additional information since these intermediate
Although the searching time complexity is theoretically
results are encrypted before the recovery of id(cq ).
O(#f ), we still achieve practical efficiency. For Scheme-II,
the computations involved in Scheme-II are XOR and equal- Definition 10. (Leakage function L3 for update). Given a
ity checking, which are extremely efficient. For Scheme- multimedia data collection f , an update object u, L3 (f , u) =
+#f 0
I, due to the use of packing technique, instead of #f {id(fi ), |fi |}#f
i=#f +1 for add updates, and L3 (f , u) =
ciphertexts, only a fraction of it is being processed. Our {id(fi )}i∈#f 0 for delete updates.
experiment also confirmed that the involved computation,
which is public homomorphic aggregation of ciphertexts at Theorem 1. Assuming the decisional composite residuosity prob-
the server side, and the decryption of aggregated ciphertexts lem is hard (resp. the existence of one-way functions), Scheme-I
at the client side, attains practical efficiency. (resp. Scheme-II) is CQA2-secure in the standard model.

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9

Proof. The only difference between Scheme-I and Scheme- Theorem 2. Assuming the decisional composite residuosity prob-
II lies in how the IFVs are encrypted. We only use the lem is hard (resp. the existence of one-way functions), Scheme-I
fact that both methods are CPA-secure in the following (resp. Scheme-II) is forward private in the standard model.
proof, without utilizing the homomorphic property. Thus,
the proofs for both schemes are identical. Proof. By the same argument as in the proof of Theorem 1,
For either scheme, we argue its CQA2-security by we can assume that the encryption schemes in both Scheme-
describing a polynomial-time simulator S , such that for I and Scheme-II are CPA-secure. In the forward privacy
game, upon receiving the challenge updates u0 and u1 ,
any PPT adversary A, the outputs of RealSSE A (λ) and
the challenger simply encrypts zeros to produce dummy
IdealSSE
A,S (λ) are indistinguishable, as follows:
ciphertexts c0 and dummy index If 0 . If the adversary can
[Setup] S is given L1 (f ) = {|fi |}i=1 . Let I be a
#f distinguish the dummy update from the real ones, we can
dictionary of size M m mapping λ-bit random strings construct an adversary which breaks the CPA-security of
to encryption of zeros by the underlying encryption the encryption scheme, which happens with probability less
scheme. The number of zeros encrypted is determined than negl(λ). Suppose that the adversary cannot distinguish
by #f (and the packing method described in the the above. The simulated ciphertexts and index contain no
construction for Scheme-I). Generate k20 ← SKE.Gen(1λ ) information about u0 and u1 , so the success probability of
and ci ← SKE.Enck20 ({0, 1}|fi | ) for i ∈ [#f ]. Output the adversary is exactly 21 . Thus, the overall success proba-
γ = (I, c), where c = {c1 , . . . , c#f }. S maintains internally bility of the adversary is bounded above by 12 + negl(λ).
a mapping from the M m identifiers of the buckets to the
λ-bit random strings generated above.
4 T HEORETICAL A NALYSIS OF S EARCH Q UALITY
[Simulating Queries] S is given L2 (f , q) = For search quality of our system, the computation of union
{π(f , q), Ap (f , q), E(q)}. Consider the M identifiers of of IFVs in consecutive neighboring buckets can greatly re-
buckets corresponding to the query object q . S sends the duce the false negative rate, and the computation of bitwise
random strings corresponding to these identifiers to A. intersection of IFVs in multiple hashing-hit buckets during
Upon verifying the response from A against the index I , S search can eliminate a large amount of false positives.
sends the file identifiers given in Ap (f , q) to A, who then Based on Definition 7, each LSH hi (q) = b a·q+b W c maps
returns the corresponding ciphertexts. the d-dimensional points q onto intervals of length W .
After mapping all feature vectors p extracted from f , the
[Simulating Updates] For add updates, S is given
+#f 0
probability that a query q falls into the same bucket as some
L3 (f , u) = {id(fi ), |fi |}#f i=#f +1 , which are simulated point p depends on the interval length W , and the distance
as in the setup phase. For delete updates, S is given between q and p (i.e., k p − q k2 ).
L3 (f , u) = {id(fi )}i∈#f 0 . S simulates a fresh If 0 as in the Fig. 2 shows the probability that q falls into the same
setup phase, and sends (If 0 , {id(fi )}i∈#f 0 ) to A. bucket as p and the probability that q falls into the left or
The indistinguishability between RealScheme−IA (λ) and right neighboring bucket of p. Consider an LSH hi (·). For a
IdealScheme−I
A,S (λ) follows from the CPA security of SKE or fixed p, let fp (q) be the probability that q and p fall into the
the XOR-based encryption scheme, and the indistinguisha- same interval (i.e., collide with each other under ha,b (·)). In
bility of PRF from random functions. addition, we use fp− (q) and fp+ (q) to denote the probability
that q falls into the neighboring left and right intervals, re-
It is easy to note that the simulation above is quite spectively. For ease of analysis, we let x(δ) (δ ∈ {−1, 1}) be
simple. Specifically, our schemes only return an aggregated the absolute distance of p from the boundary of the interval
encrypted IFV (for Scheme-I) or a number of encrypted IFVs ha,b (p) + δ . According to the definition of ha,b (q) = b a·q+b
W c,
(for Scheme-II), instead of directly retrieving the encrypted the difference between the projections of q and p onto the
results for the client. That is why the simulator can just line is (a · q + b) − (a · p + b) = a · (q − p), distributed
return a (set of) random string(s) with appropriate length as k p − q k2 X , where X follows a Gaussian distribution.
since the adversary simply cannot distinguish without the When W is large enough, q can fall into the same bucket as
corresponding secret key for decryption. Also, in retrieving p, or its left/right neighboring slots. Formally:
the encrypted results for the client in the final stage, the Lemma 3. The probability that two items p and q collide for
simulator does not need to directly work with the encrypted an LSH is
index. Instead, the information available from the leakage Z W
function already allows it to do that. From another perspec- 1 t t
fp (q) = p(u) = Pr[ha,b (q) = ha,b (p)] = g( )(1− )dt,
tive, the adaptive security of our scheme can be achieved in 0 u u W
the standard model due to the non-trivial communication
where p(u) is the probability as a function of u, u =k p−q k2
overhead, i.e., the encrypted IFV is of length O(#f ).
and g(t) is the probability density function of the absolute
For forward privacy, observing that adding new files to 2
value of the 2-stable distribution, namely, g(t) = √12π e−t /2 .
the database introduces new columns in encrypted form;
and the decryption of IFVs is performed at the client Proof. Set t = |(a · p + b) − (a · q + b)| = |a · (p − q)|,
side during queries. Thus, even if equipped with previous x = ut . Therefore, x is distributed as X . As shown in
queries, it is obvious that the server is unable to decrypt the Fig. 3, the probability that a·q+b falls into the same slot as
u
newly added portions of the IFVs. a·p+b RW W
u −x
u is 0
u
g(x)( W )dx . According to the substitution
u

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10

where S denotes the collection of the approximate members


of q in the dataset, pi ∈ S, u ≤ R in Eq. (1a), pi ∈ f \ S,
u > R in Eq. (1b).
Proof. The probability that item q falls into the slot to the
right (left) of p depends on how close q is to the right (left)
boundary of its slot. We have
Z x(−1)+W
a " pi ! b x − 1 t
xi (#1) xi (!1)
ai p+b W fp (q) = Pr[ha,b (q) = ha,b (p) − 1] = g( )dt
ha,b ( pi ) #1 ha ,b ( pi ) ha,b ( pi ) ! 1
u u
x(−1) u u
2
Fig. 2: Probability of query q Fig. 3: Probability of query q ≈ e−Ax(−1) ,
and its approximate neigh- and its approximate neigh- Z x(+1)+W
1 t
bors falling into different in- bors falling into the same fp+ (q) = Pr[ha,b (q) = ha,b (p) + 1] = g( )dt
x(+1) u u
tervals intervals
−Ax(+1)2
≈e .

rule for the definite integral


Rb
f (x)dx =

f (ψ(t))ψ 0 (t)dt, The probability density function of a Gaussian random
a α 2 2
we have variable is e−x /2σ (scaled by a normalizing constant).
Z W W W W Thus, according to [18], the probability that point q falls into
−x − ψ(t)
Z
u
g(x)( u W )dx = g(ψ(t))( u
)dψ(t) the neighboring slot of p can be estimated by the equations
W
0 u 0 u above, where A is a constant depending on k p − q k2 and
W W t x(+1) + x(−1) = W . Given an (R, cR, P1 , P2 )-sensitive
t u − u 1
Z
= g( ) · ( W
)· dt LSH family, we have fp− (q) and fp+ (q). In our similarity
0 u u
u
W
search index, probing one bucket is equivalent to probing
1 t t
Z
= g( )(1 − )dt. three neighboring buckets at a time. Suppose the set of
0 u u W approximate members of the query q is S. After hashing and
randomization, the probability of mapping q into bucket k
(which contains IFVs of the approximate members of q ) is
Because the variance of uX is u2 , the increase of u will
(fp−i + fpi + fp+i ) for each pi ∈ S. With M LSH functions, the
lead to the decrease of collision probabilities. According to
probability that q is successfully mapped into all M target
Definition 7, we have P1 = p(1) and P2 = p(c) for R = 1 QM
buckets is fsuccessi = j=1 (fp−ij + fpij + fp+ij ). Thus, for each
(Readers can refer to [27] for more details). Further, we have
Z W Z W Z W member of S the probability of a false negative alarmed by
t t at least one bucket hit miss is then given by Eq. (1a).
p(1) = g(t)(1 − )dt = g(t)dt − g(t) dt
0 W 0 0 W For item i in set f \ S, the probability that q collides with
Z W
1 2 it for the LSH function j is (fp−ij + fpij + fp+ij ). Thus, if pi is
− W2
= g(t)dt + √ (e − 1) not an approximate neighbor of q , we can figure out the false
0 W 2π
positive probability by the probability that a query collides
1 W2
= normcdf (W ) − 1/2 + √ (e− 2 − 1) with legitimate items by all M hash functions as in Eq. (1b).
W 2π After computing fnegi for each pi ∈ S, the expected
Z W
1 t t P#S
number of correctly returned files is i=1 (1 − fnegi ). We
p(c) = g( )(1 − )dt
0 c c W can estimate the recall of our schemes as in Eq. (1c).
Z W
1 t c W2
= g( )dt + √ (e− 2c2 − 1)
0 c c W 2π 5 E XPERIMENT E VALUATION
c W2
= normcdf (W/c) − 1/2 + √ (e− 2c2 − 1), We evaluate by thorough experiments the efficiency and
W 2π effectiveness of our constructions. The experiments are per-
where normcdf (·) denotes the distribution function of 2- formed on a machine with an Intel Core i5-4460S 2.90GHz
stable distribution. processor, 12 GB of DRAM and a 500GB 5400RPM SATA
disk. We report space cost, search time, recall, and error ratio.
Theorem 4. Given an (R, cR, P1 , P2 )-sensitive LSH family
for a distance function || ∗ ||2 , the false negative rate, the
false positive rate and the recall (defined in Section 5.2) of 5.1 Evaluation Datasets
the encrypted similarity index are We choose the following three representative datasets sum-
M
Y marized in Table 1, showing that our encrypted index for
fnegi = 1 − (fp−ij + fpij + fp+ij ), (1a) different dataset sizes can easily fit into the main memory.
j=1
M
Y 5.1.1 Forest Covertype Trace
fposi = (fp−ij + fpij + fp+ij ), (1b) The actual forest cover type for a given observation (30 × 30
j=1 meter cell) was determined from US Forest Service (USFS)
1 X
#S Region 2 Resource Information System (RIS) data. Indepen-
recall = (1 − fnegi ), (1c) dent variables were derived from data originally obtained
#S i=1
from US Geological Survey (USGS) and USFS data. Data

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 11

Dataset type # Files Feature dimension Dataset size


use I(q) to denote the set of ideal answers. Assume A(q)
Forest Covertype 581,012 54 119.9 MB
Image Data I 59,500 400 22.7 MB
is the set of actual answers returned, Recall is defined by
Image Data II 14,950 512 27 GB |A(q) ∩ I(q)|
Recall = .
|I(q)|
TABLE 1: Characteristics of experimental datasets
Our goal is to generate a fine candidate set which will
be used in the implementation of privacy-preserving k NN
is in raw form (not scaled) and contains binary (0 or 1) or ANN searches to the query object, where only top-k
columns of data for independent qualitative variables candidates or the candidates under a pre-defined distance
(wilderness areas and soil types). This data set3 contains constraint R (i.e., the ideal answer I(q)) will be returned. So
581, 012 data points. Each has 54-dimensional attributes, we do not directly consider the precision metric as in the
which include 10 quantitative variables, 4 binary wilderness plaintext similarity search domain [18], [28], [29].
areas, and 40 binary soil type variables. It is commonly used Error ratio is commonly used to measure the approximate
to predict forest cover type from cartographic variables only. nearest neighbor search. It evaluates how close the distances
of the most similar files found by our index scheme when
5.1.2 Image Data I compared to the truly most similar files’ distance (com-
The image data set is obtained from a project of puted using the corresponding feature vectors). For a set
Shakhnarovich4 . It contains 59, 500 of 20 × 20 grey-level of queries Q, error ratio is defined by
image patches taken from a bunch of motorcycle images. We N
reshaped each image into a 400-dimensional column vector. 1 X X dLSHk
error ratio = ,
|Q|N q∈Q k=1 d∗k
5.1.3 Image Data II
The image data set consists of 14,950 high quality pictures where N = |A(q) ∩ I(q)|, dLSHk is the distance of the k th
and the total size is about 27GB. For each image, we use most similar file found by our scheme, and d∗k is the distance
the color histogram extraction tool from FIRE5 to extract a of the true k th most similar file. Note that all distances
512-dimensional color histogram. are computed using feature vectors corresponding to the
multimedia data file. Obviously, in the ideal case the values
5.2 Evaluation Metrics of recall and error ratio are 1.0 for both.
Our experiment randomly generates queries and extracts
the corresponding feature vectors as the search objects for 5.3 Experimental Results
each dataset. All measures are averaged over these queries. Tables 2, 3 and 4 show the main results. For each dataset,
Recall that our goal is to generate candidate set of the tables report the search time, the bucket length, and
approximate files for refinement such as privacy-preserving the space cost when achieving different recall values, the
approximate nearest neighbors (ANN) search or k -nearest error ratios, and the ratios of the original dataset size to the
neighbor (k NN). We not only performed experiment on candidate set size φ(d) #f
, where φ(d) denotes the size of the
different types of data, but also showed the efficiency returning candidate set.
and search quality under different constraint (i.e., distance As shown in our results, for about 0.6 million items
threshold R for ANN and parameter k for k NN). For each in the Forest Trace dataset, the search time of Scheme-
search feature vector in the Forest Trace dataset and image I and Scheme-II are 49.94s and 0.1561s, respectively. For
dataset I, the ideal answer is defined to be the most similar the actual communication cost, the returned intermediate
files whose feature vectors are the nearest neighbors of the encrypted results are also very small compared to the size
search one, under a pre-defined distance constraint R using of original data files. As illustrated in Table. 5, the commu-
the Euclidean distance measure. For the image dataset II, the nication costs of both schemes are practically acceptable.
ideal answer is defined to be the k nearest neighbors of the We have demonstrated that our schemes achieve a high
query object. More specifically, we measure the performance recall (i.e., the fraction of relevant instances can be retrieved).
of our encrypted similarity search scheme in four aspects: Note also that, as we will show, our system can speed up
space cost, search time, recall and error ratio. tens to hundreds of times (i.e., φ(d) #f
) compared with the
Space cost is to evaluate the space efficiency of the index naı̈ve approach of scanning the whole database. The client
data structure. The space cost of the index should be practi- thus gets a highly relevant candidate set which are much
cal compared to the original dataset size. smaller to perform a refinement such as k NN or ANN
Search time is to measure the search speed of answer- searches. This is why we do not explicitly consider precision,
ing one search query over the encrypted similarity index. which is the fraction of retrieved instances that are relevant.
It includes the times of locating the hit bucket positions The results show that our encrypted similarity index
in the index, decrypting of returned encrypted IFVs, and is highly space-efficient and time-efficient. In all cases, the
computing the intersection of IFVs. number of hash tables required is only 5 and each hash table
Recall is a typical metric in evaluation of similarity search has a small number of buckets m. Even with a very large
over plaintext data. Given a search feature vector q , we data size, the hash tables only occupy a small portion of the
3. https://archive.ics.uci.edu/ml/datasets/Covertype main memory size (without incurring a disk I/O).
4. http://ttic.uchicago.edu/∼gregory/download.html Based on the theoretical analysis of search quality in
5. Flexible image retrieval engine http://thomas.deselaers.de/fire section 4, Fig. 5(a) and Fig. 5(b) show the false positive rates

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 12
#f
Recall Error Search Time Search Time M m W Space Cost Space Cost φ(d)
Ratio of Scheme-I (s) of Scheme-II (s) of Scheme-I (MB) of Scheme-II (MB)
0.9878 1.0006 49.94 0.1561 5 37 300 102.53 12.81 112
0.9527 1.0036 49.94 0.1561 5 49 240 135.78 16.97 236
0.9132 1.0086 49.94 0.1561 5 53 210 146.88 18.35 373

TABLE 2: Search performance of Forest Trace dataset with R = 150


#f
Recall Error Search Time Search Time M m W Space Cost Space Cost φ(d)
Ratio of Scheme-I (s) of Scheme-II (s) of Scheme-I (MB) of Scheme-II (MB)
0.9915 1.0008 5.12 0.0164 5 16 800 4.54 0.57 16
0.9418 1.0041 5.12 0.0164 5 24 550 6.81 0.85 43
0.9141 1.0075 5.12 0.0164 5 26 500 7.38 0.92 54

TABLE 3: Search performance of Image dataset I with R = 500


#f
Recall Error Search Time Search Time M m W Space Cost Space Cost φ(d)
Ratio of Scheme-I (s) of Scheme-II (s) of Scheme-I (MB) of Scheme-II (MB)
0.9942 1.0042 1.29 0.0111 5 65 10 4.61 0.58 406
0.9494 1.0073 1.29 0.0111 5 81 6 5.83 0.72 719
0.9118 1.0286 1.29 0.0111 5 112 5 8.06 1.00 940

TABLE 4: Search performance of image dataset II with k = 10

1 1 1

0.95

0.9 0.95 0.8

0.85
M=5
0.9 0.6
0.8 M=10
Recall

Recall

Recall
M=5
M=10 M=15
0.75 M=5 M=15
0.85 0.4
M=10
0.7
M=15
0.65 0.8 0.2
0.6

0.55 0.75 0
160 180 200 220 240 260 280 300 400 600 800 1000 1200 2 4 6 8 10
W W W

(a) Forest Trace dataset (b) Image dataset I (c) Image dataset II

Fig. 4: Search quality for different W (R = 150 for Forest Trace dataset, R = 500 for image dataset I, k = 10 for image
dataset II)

quality and bucket length W . As we discussed before, W


should be carefully chosen to guarantee good search quality
while maintaining a lower false positive rate.
When false positive rate increases, we can use more hash
tables to get more accurate results. As shown in Fig. 6,
using more hash functions will reduce the number of falsely-
returned files and the recall will only slightly decrease. It
is also worth noting that how the recall changes (with the
increase of M ) highly depends on the characteristics of the
(a) False positive (W = 250) (b) False positive (M = 5) dataset. Thus, for a specific dataset, by choosing W and M
carefully we can obtain a balance between false positive rate
Fig. 5: False positive rate and search quality. From a practical view, we can rely on
a larger M to achieve lower false positive rate while still
maintaining the search quality at a high level.
with the change of M and W , respectively. While, other To understand Fig. 7. We first describe the definitions
parameters are fixed properly, i.e., c = 5 and R = 1. related to the file return probability F RP . We can compute
Given a fixed dataset, our search time is only related to the distance r between the feature vector of a multimedia
the number of hash tables. Even for a large-scale dataset, file and the query feature vector. Then, given a distance
desirable recall and error ratio can be achieved with high range to the query, F RP denotes the ratio of the number
search speed. With the increase of W , better search quality of files (among the returned candidate set) in the distance
can be obtained with less number of buckets in each hash range to the number of files (among the whole file collection)
table. Fig. 4 explicitly shows the relationship between search in the distance range.

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
IEEE Transactions on Dependable and Secure Computing Transactions on Dependable15
( Volume: and ,Secure
Issue:Computing
3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 13

M =5 M = 10 M = 15
Scheme-I (KB) Scheme-II (KB) Scheme-I (KB) Scheme-II (KB) Scheme-I (KB) Scheme-II (KB)
Forest Covertype Trace 567.39 709.24 712.02 1418.48 854.43 2127.72
Image Data I 58.11 72.63 72.91 145.26 87.5 217.89
Image Data II 14.60 18.25 18.32 36.50 21.99 54.75

TABLE 5: The communication costs of Scheme-I and Scheme-II are #f · dlog2 3M e bits and #f · M bits, respectively.

(a) Forest Trace dataset (b) Image dataset I (c) Image dataset II

Fig. 7: Recall of the files with different distance to the query (M = 5, R = 150 and W = 300 for Forest Trace dataset;
M = 5, R = 500 and W = 800 for image dataset I; M = 5, k = 10 (R ≈ 3.3277), and W = 10 for image dataset II)

the forest dataset, r < 300 for the image dataset I and
&'$%
!
$

r < 5 for the image dataset II). The results are consistent
$%

with Definition 7: the larger the c is the less probable the


)*+,-.'/0'.-1*.2-3'045-6

dissimilar files are returned.


+,-.//

"
%&''(
Table 5 gives the communication costs of running
!
Scheme-I and Scheme-II with different datasets. The exper-
imental results show that while the communication cost
% %&''
of Scheme-II is slightly larger than that of Scheme-I, the
! " # $% $ $! ! " # $% $ $!
'( )* communication efficiency of both schemes are practically
(a) Forest Trace dataset (W = 300) (b) Forest Trace dataset (W = 300)high. For search efficiency, the search time in Scheme-II is
)*$% two orders of magnitudes smaller than that in Scheme-I.
!
(&' $

( %&'')
,-./01*23*104-1506*37809

&'
%&''

6 C ONCLUSION
+,-.//

%&'#)
$&'

$
%&'#
We investigated the problem of privacy-preserving sim-
%&'
%&'() ilarity search over encrypted feature-rich data. We pro-
%
%&'( posed a high-speed and compact similarity search index
! " # $% $ $! ! " # $% $ $!
*+ * supporting efficient file and index updates. Based on well-
(c) Image dataset I (W = 800) (d) Image dataset I (W = 800) defined security models with leakages, we proved our index
#""" !
constructions are semantically secure against adaptively
chosen query attack. Theoretical performance analysis was
also presented to carefully characterize our index designs.
&'()*+,-.,+*/'+0*1,.23*4

! ""
"%&&
Utilizing three different representative real-world datasets,
)*+,--

!""" we showed that our index constructions are highly compact,


"%&&
practically secure and enabling efficient similarity search
"" over encrypted multimedia data with high search quality.

" "%&'
!" ! #" # $" !" ! #" # $"
% (

(e) Image dataset II (W = 10) (f) Image dataset II (W = 10) 7 ACKNOWLEDGMENTS


Fig. 6: Influence of the number of hash tables Qian’s research is supported in part by National Natural
Science Foundation of China under Grant No. 61373167
and National Basic Research Program of China (973 Pro-
For a fixed search query, W , and M , the closer the gram) under Grant No. 2014CB340600. Sherman Chow is
distances between feature vectors of files to the search query, supported by General Research Fund Grant No. 14201914
the higher probability that those files will be returned. Our and the Early Career Award from Research Grants Council,
scheme returns almost all the closest files (r < 100 for Hong Kong.

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 14

R EFERENCES Qian Wang received the B.S. degree from


Wuhan University, China, in 2003, the M.S. de-
[1] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for
gree from Shanghai Institute of Microsystem and
searches on encrypted data,” in S&P’00, 2000, pp. 44–55.
Information Technology, CAS, China, in 2006,
[2] S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchable
and the Ph.D. degree from Illinois Institute of
symmetric encryption,” in CCS ’12, 2012, pp. 965–976.
Technology, USA, in 2012, all in Electrical En-
[3] R. Curtmola, J. A. Garay, S. Kamara, and R. Ostrovsky, “Searchable
gineering. He is a Professor with the School of
symmetric encryption: Improved definitions and efficient con-
Computer Science, Wuhan University. His re-
structions,” in CCS ’06. ACM, 2006, pp. 79–88.
search interests include wireless network secu-
[4] D. Cash, S. Jarecki, C. S.Jutla, H. Krawczyk, M.-C. Rosu, and
rity and privacy, cloud computing security, and
M. Steiner, “Highly-scalable searchable symmetric encryption
applied cryptography.
with support for boolean queries,” in CRYPTO, 2013, pp. 353–373.
[5] F. Hahn and F. Kerschbaum, “Searchable encryption with secure Meiqi He received the B.S. degree in Information
and efficient updates,” in CCS ’14. ACM, 2014, pp. 310–320. Security from Wuhan University, China, in 2015.
[6] S. Kamara and C. Papamanthou, “Parallel and dynamic searchable She is working towards the Ph.D. degree in the
symmetric encryption,” in Financial Crypt. ’13, 2013, pp. 258–274. Computer Science Department at The University
[7] E. Stefanov, C. Papamanthou, and E. Shi, “Practical dynamic of Hong Kong. Her research interests include
searchable encryption with small leakage,” in NDSS’14, 2014. cloud computing, information security, applied
[8] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keyword cryptography and bioinformatics.
search over encrypted data,” in ACNS ’04, 2004, pp. 31–45.
[9] D. Boneh and B. Waters, “Conjunctive, subset, and range queries
on encrypted data,” in TCC ’05, 2007, pp. 535–554.
[10] T. Moataz and A. Shikfa, “Boolean symmetric searchable encryp-
tion,” in ASIACCS ’13. ACM, 2013, pp. 265–276. Minxin Du received the B.S. degree in Com-
[11] M. Chase and S. Kamara, “Structured encryption and controlled puter Science and Technology from Wuhan Uni-
disclosure,” in ASIACRYPT, 2010, pp. 577–594. versity, China, in 2015. He is working towards
[12] R. W. F. Lai and S. S. M. Chow, “Structured encryption with non- the Master degree in the Computer School in
interactive updates and parallel traversal,” in ICDCS, 2015, pp. Wuhan University. His research interests include
776–777. cloud computing, information security, and ap-
[13] R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis plied cryptography.
and performance study for similarity-search methods in high-
dimensional spaces,” in VLDB’98, 1998, pp. 194–205.
[14] Q. Lv, M. Charikar, and K. Li, “Image similarity search with
compact data structures,” in CIKM’04. ACM, 2004, pp. 208–217.
Sherman S.M. Chow is an assistant professor
[15] I. Giangreco, I. Al Kabary, and H. Schuldt, “ADAM — database
in the Chinese University of Hong Kong. He was
and information retrieval system for big multimedia collections,”
a research fellow at University of Waterloo, after
in Intl. Congress on Big Data, 2014.
receiving his Ph.D. degree from New York Uni-
[16] M. Kuzu, M. S. Islam, and M. Kantarcioglu, “Efficient similarity
versity. His main interest is in Applied Cryptogra-
search over encrypted data,” in ICDE’12, 2012, pp. 1156–1167.
phy. He has published in CCS, EuroCrypt, ITCS,
[17] G. R. Hjaltason and H. Samet, “Properties of embedding methods
and NDSS. He served on the program commit-
for similarity searching in metric spaces,” IEEE Trans. Pattern Anal.
tees of AsiaCrypt, CCS, ESORICS, ICDCS, In-
Mach. Intell., vol. 25, no. 5, pp. 530–549, 2003.
focom, PKC, and SACMAT. He is a program co-
[18] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-
chair of AsiaCCS-SCC 15, ISC 14, and ProvSec
probe LSH: Efficient indexing for high-dimensional similarity
14, and an editor of IEEE Trans. of Information
search,” in VLDB’07, 2007, pp. 950–961.
Forensics and Security, Intl. J. Information Security, and J. of Information
[19] J. Yuan and S. Yu, “Efficient privacy-preserving biometric identifi-
Security and Applications. He has received the Early Career Award
cation in cloud computing,” in INFOCOM, 2013, pp. 2652–2660.
2013/14 from the Hong Kong Research Grants Council.
[20] T. Moataz, A. Shikfa, N. Cuppens-Boulahia, and F. Cuppens,
“Semantic search over encrypted data,” in ICT’13, 2013, pp. 1–5. Russell W.F. Lai received the B.S. degree in
[21] B. Wang, S. Yu, W. Lou, and Y. T. Hou, “Privacy-preserving multi- Mathematics and B.Eng. degree in Information
keyword fuzzy search over encrypted data in the cloud,” in IEEE Engineering from The Chinese University of
INFOCOM’14, 2014. Hong Kong. He is working towards the M.Phil.
[22] W. K. Wong, D. W.-L. Cheung, B. Kao, and N. Mamoulis, “Se- degree in the Department of Information Engi-
cure kNN computation on encrypted databases,” in SIGMOD’09. neering at The Chinese University of Hong Kong.
ACM, 2009, pp. 139–152. His research interest is in Applied Cryptography,
[23] T. Deselaers, D. Keysers, and H. Ney, “Features for image retrieval: with focus on Searchable Encryption, Obfusca-
an experimental comparison,” Information Retrieval, vol. 11, no. 2, tion, and Lattice-based Cryptography.
pp. 77–107, 2008.
[24] S. Har-Peled, P. Indyk, and R. Motwani, “Approximate nearest
neighbor: Towards removing the curse of dimensionality,” Theory Qin Zou received his BE degree in information
of Computing, vol. 8, no. 1, pp. 321–350, 2012. science in 2004 and PhD degree in photogram-
[25] J. Katz and Y. Lindell, Introduction to modern cryptography: principles metry and remote sensing (computer vision) in
and protocols. CRC Press, 2007. 2012 from Wuhan University. Qin Zou is as-
[26] P. Paillier, “Public-key cryptosystems based on composite degree sistant professor with the School of Computer
residuosity classes,” in EUROCRYPT’99, 1999, pp. 223–238. Science, Wuhan University. His research activi-
[27] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality- ties involve computer vision, machine learning,
sensitive hashing scheme based on p-stable distributions,” in SGC ubiquitous computing, intelligent transportation
’04. ACM, 2004, pp. 253–262. systems, etc.
[28] A. Joly and O. Buisson, “A posteriori multi-probe locality sensitive
hashing,” in MM ’08. ACM, 2008, pp. 209–218.
[29] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing
for scalable image search,” in ICCV ’09. IEEE, 2009, pp. 2130–2137.

1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like