Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
JOURNAL OF LATEX CLASS
IEEE Transactions onFILES, VOL. 13, NO.
Dependable and9,Secure
SEPTEMBER 2014
Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 ) 1
Abstract—Storage services allow data owners to store their huge amount of potentially sensitive data, such as audios, images, and
videos, on remote cloud servers in encrypted form. To enable retrieval of encrypted files of interest, many searchable symmetric
encryption (SSE) schemes have been proposed. However, most existing SSE solutions construct indexes based on keyword-file pairs
and focus on boolean expressions of exact keyword matches. Moreover, most dynamic SSE solutions cannot achieve forward privacy
and reveal unnecessary information when updating the encrypted databases.
We tackle the challenge of supporting large-scale similarity search over encrypted feature-rich multimedia data, by considering the
search criteria as a high-dimensional feature vector instead of a keyword. Our solutions are built on carefully-designed fuzzy Bloom
filters which utilize locality sensitive hashing (LSH) to encode an index associating the file identifiers and feature vectors. Our schemes
are proven to be secure against adaptively chosen query attack and forward private in the standard model. We have evaluated the
performance of our scheme on various real-world high-dimensional datasets, and achieved a search quality of 99% recall with only a
few number of hash tables for LSH. This shows that our index is compact and searching is not only efficient but also accurate.
Index Terms—Cloud storage, searchable encryption, homomorphic encryption, similarity search, proximity search
F
1 I NTRODUCTION
1.1 Related Work
S TORAGE services have motivated data users, including
both individuals and enterprises, to outsource their (po-
tentially huge amount of) data to remote servers (e.g., cloud) The first full-text SSE scheme was proposed by Song et al. [1]
to save the expensive local storage and management costs. with search time linear in the length of the file collection.
However, outsourced data is no longer under the direct Curtmola et al. [3] generalized security definitions of SSE
physical control of the data owner. Sensitive data, such as and proposed a construction based on the inverted index.
personal files, commercial secrets, and healthcare records, The search time is linear in the number of files that contain
should be encrypted locally before outsourcing. Data en- keyword w which is considered as optimal. Kamara et al. [2]
cryption, however, renders data utilization a challenging extended it to dynamic data which supports both addition
task. In particular, it becomes difficult to retrieve data of and removal of files. Kamara et al. [6] proposed a paralleliz-
interest based on their content as in the plaintext search. able and dynamic SSE scheme using red-black trees.
To enable users to efficiently retrieve encrypted data The above dynamic SSE schemes cannot achieve for-
of interest from a remote storage server, the notion of ward privacy, which means the search token allows linking
searchable symmetric encryption (SSE) was proposed [1] the files to be added in the future. The scheme of Ste-
and design of index-based SSE schemes has received much fanov et al. [7] is forward private. Although the theoretical
attention recently. The secret key in SSE can generate search cost is sub-linear, the actual search time cost will be large
tokens to perform search queries over encrypted data. Dif- when the number of document-keyword pairs is large.
ferent formal security notions have been proposed (e.g., Solutions [4], [8], [9], [10] supporting boolean expres-
see [2]). By sacrificing access pattern and/or search pattern, sions on keywords have also been proposed. Performing
most SSE schemes achieve highly efficient searches while conjunctive queries over encrypted data was first consid-
maintaining privacy guarantee from a practical perspective. ered by Golle et al. [8]. Their approach works by testing
Most existing SSE solutions [2], [3], [4], [5] constructed each encoded file against a set of tokens, so the complexity
keyword-based search indexes (e.g., linked lists based on grows linearly with the number of files. This is also a
keyword-file pairs), supporting only efficient exact but general problem of public-key scheme [9]. For supporting
not similarity searches. Moreover, most dynamic schemes generic boolean search over the encrypted data, Moataz and
cannot guarantee forward privacy, which requires that the Shikfa [10] proposed a new SSE scheme which is also based
database server cannot learn whether a newly added data on vectors (but not for similarity search). By transferring
file contains a keyword w that has been searched in the past. keywords to vectors and using the Gram-Schmidt process
to orthogonalize and orthonormalize them, their scheme
leverages the inner product of vectors to execute the en-
• Q. Wang, M. Du, and Q. Zou are with The State Key Lab of Software
Engineering, School of Computer Science, Wuhan University, China. crypted search. The work of Cash et al. [4] first retrieves the
E-mail: {qianwang,duminxin,qzou}@whu.edu.cn results for one of the query keywords, and then filters the
• M. He is with the Department of Computer Science, The University of results according to the given boolean query against each
Hong Kong, Hong Kong.
E-mail: mqhe@hku.hk
of remaining keywords. The above solutions only support
• S. Chow and R. Lai are with the Department of Information Engineering, keyword-based exact searches and do not allow similarity
The Chinese University of Hong Kong, Hong Kong. search over encrypted feature-rich data.
E-mail: {sherman,wflai}@ie.cuhk.edu.hk
As a generalization of SSE, structured encryption is
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2
proposed by Chase and Kamara [11]. The data structures means that one needs to tag the data with many possible
supported are matrices and graphs. This is further gener- keywords, which may not be practical.
alized by Lai and Chow [12] which supports any abstract Based on similar approaches, Wang et al. [21] proposed
data type in the form of query and response pair. However, a multi-keyword fuzzy search scheme over encrypted data
utilizing their scheme in this setting means one needs to by using LSH and secure k nearest neighbors (k NN) [22].
pre-compute all possible queries, which does not satisfy our Still, this scheme lacks formal and thorough analysis under
goal to support unforeseen queries. a well-defined and rigorous security model as previously
pointed out [22]. In contrast, most existing SSE schemes
from the cryptography community have their security rig-
1.2 Similarity Search
orously analyzed [1], [2].
Similarity search has been widely used in the information
retrieval community for content-based search of feature-
rich data (e.g., image, audio, and video files), and other 1.3 Our Contributions
queries such as finding close/similar points in certain metric
spaces [13]. The feature-rich data objects are usually repre- This paper tackles the problem of constructing a secure
sented as high-dimensional vectors in a metric space. Dis- similarity index for efficient approximate members search
tances between objects are measured by Euclidean distance over large-scale encrypted feature-rich data. The key idea
or other distance metrics. In contrast to exact search, similar- is to transform the data into a set of feature vectors, which
ity search can relax constraints on users’ requests to retrieve are further mapped by LSHs to an array position. We call
a list of desirable results based on likely relevance. Similarity the entry there a bucket, which will be pointing to the
search also tolerates uncertain and inaccurate search inputs, possible matching files. Since the LSH sends all similar files
thus improves usability and search experience in general. to the same output, together with our special treatment
With the increasing popularity of sharing multimedia of propagation (to be explained), it hence creates a cluster
data in social networks, the demand for similarity search of buckets which contains files approximately close to the
over encrypted multimedia data is also increasing. For ex- query object with a high probability.
ample, users of social networks may want to identify and To represent the pointers to the file, we store the inverted
view multimedia files shared by their friends who enforced file identifier vectors (IFVs) in each of this bucket. An IFV is
access control. Yet, similarity search has received attention a vector which represents the files that fall in a given bucket
mostly in the plaintext domain [14], [15]. for one hash function (among the set of hash functions
Below we discuss three recent approaches for estab- in the LSH). For a trade-off of computation efficiency and
lishment of similarity indexes while providing privacy- bandwidth, we have two options in encrypting the resulting
preserving content-based searches. Earlier approaches can IFVs, which are additive homomorphic encryption (for sav-
be found in the references within. ing bandwidth) or pseudo-random functions/permutations
Kuzu et al. [16] investigated the use of locality sensitive (for more efficient computation).
hashing (LSH) family for supporting similarity search based The resulting schemes enable the client to perform
on Jaccard distance, and used computationally-expensive privacy-preserving similarity search by checking only a few
methods to encrypt the index. The main problem of their number of hash tables (or buckets). The response set consists
scheme stems from the embedding technique for transform- of approximate files which can be readily used for privacy-
ing the metric space on edit distance to the Euclidean space. preserving k NN or approximate nearest neighbors (ANN)
It is proven that existing embedding approaches cannot search. Another important and desirable property of our
provide sufficient distance preservation after space transfor- index constructions is that they are highly compact, neat
mation, and lead to intolerable search errors [17]. Another and efficient, and they can also efficiently support secure
limitation is the large number of hash tables required to dynamic updates of file and index with minimal operations.
cover the nearest objects. When this space requirement Our main contributions are summarized as follows.
exceeds the main memory size, looking up the hash tables 1) We study the problem of similarity search over en-
may cause substantial delay due to disk I/O access [18]. crypted feature-rich data outsourced to remote servers. We
Moreover, it lacks a rigorous performance analysis, i.e., there characterize the unique security requirements and formally
is no theoretical characterization of the search quality. define the security notions for encrypted similarity search.
Yuan and Yu [19] proposed a privacy-preserving biomet- We then present design of our encrypted indexing schemes
ric identification for large-scale encrypted databases. The with two instantiations supporting privacy-preserving high-
search complexity, however, is approximately O(mn2 ) for dimensional similarity search and data dynamics.
the exact search, where m denotes the dataset size and n is 2) We present a thorough theoretical analysis and charac-
the vector dimension. The security of their scheme relies on terize the false positive and false negative rates of our con-
heuristic argument that security comes from the “encryp- structions. We prove the semantic security of our schemes
tion” of feature vectors by randomly-generated matrices. It under the adaptively chosen query attack (CQA2) model
is possible that the introduced randomness can be removed with minimal leakage using simulation-based definitions
in the presence of collusion. In addition, Moataz et al. [20] and proofs. We also show that they achieve forward privacy.
made the first attempt at performing similarity search over 3) We evaluate our constructions against three real-world
text keywords by combining the searchable encryption tech- datasets, with different sizes and qualities. which showed
nique with a stemming algorithm to achieve fuzzy searches that they are both time- and space-efficient, and they pro-
over encrypted data. Applying their technique to our setting vide high search quality in terms of recall and error ratio.
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3
2 P ROBLEM S TATEMENT AND P RELIMINARIES abilistic) protocol run between the client and the server. The client
2.1 Similarity Search over Encrypted Feature-rich Data side takes as input a secret key sk and an update object u, while
the server side takes as input the encrypted database γ . Upon
We consider a data owner, also known as the client, with
completion of the protocol, the client obtains nothing, while the
some large-scale and sensitive multimedia data to be out-
server obtains an updated encrypted database γ 0 .
sourced to a cloud-based storage system. The client encrypts
the data locally before outsourcing. The collection of multi-
media data is denoted by f = (f1 , . . . , f#f ), and its corre- 2.2 Security Definitions
sponding encrypted form is denoted by c = (c1 , . . . , c#c ). Similarity search over encrypted feature-rich data should
To enable similarity search over c for effective informa- provide the following security guarantees. First, an ad-
tion retrieval, the data owner builds an encrypted index I versary cannot generate search tokens for feature vectors
based on f and encrypts f to obtain c, then outsources the without the secret key. Second, the search tokens generated
encrypted data γ = (I, c) to the remote cloud server. Later, for the similarity indexes do not reveal any information of
the client issues a query in the form of a search object q . This the original search objects (i.e., feature vectors of files of
object q will be represented by a feature vector throughout interest) beyond what is implied by the search result. Third,
the rest of the paper, which is a multi-dimensional vector before and after similarity searches, the server learns only
of numerical features that represents some object (e.g., some what are allowed to be leaked by the data owner, i.e., the
selected key scene of a video file, color histograms for image search pattern and access pattern.
data [23], etc). The query may just near to some of the files We follow the widely-accepted framework to analyze the
but may not exactly match with any of them. To this end, the security of SSE [2], [3]. The security of an SSE scheme can be
client generates a search token τq for q and sends it to the characterized via the notion of history, search pattern, and
cloud server. After receiving τq , the cloud uses it to query I access pattern [3]. Below we will give their definitions with
and returns all the encrypted file IDs whose corresponding respect to the SSE schemes we are going to propose.
files cq have characteristics similar to the search object q .
The system is dynamic, which means that the client may Definition 2. (History Q). An s-query history over f is a vector
add or remove multiple files u = f 0 to or from the servers at of s queries Q = {q1 , . . . , qs }.
any time. To do so, the client generates an update token τu . Definition 3. (Search Pattern π ). Given a search object q at
The server can then use τu to securely update c and I . time t, the search pattern π(f , q) is defined by a binary vector of
Formally, the core functionalities are defined below. length t with a ‘1’ at location i if the search at time i ≤ t was
Definition 1. An encrypted (multimedia) data storage system for q ; ‘0’ otherwise.
supporting similarity search and dynamic update consists of the The search pattern reveals whether the same search was
following seven polynomial-time algorithms/protocols: performed in the past or not.
sk ← Gen(1λ ): is a probabilistic key generation algorithm run Definition 4. (Access Pattern Ap ). Given a search object q at
by the client. It takes as input a security parameter λ and outputs time t, the access pattern is defined by Ap (f , q) = {id(cq )}.
the secret key sk .
Now we come to the main definition asserting the secu-
γ ← Enc(sk, f ): is a probabilistic algorithm run by the client. It rity of an SSE scheme. Our definition follows the simulation-
takes as input a secret key sk and a (multimedia) data collection based framework [3], parameterized by the leakage required
f , and outputs an encrypted database γ . for the ideal world simulation [11].
An encrypted database is defined by the tuple (c, I), where c is Definition 5. (CQA2-security for SSE). Let L1 , L2 , and L3
a ciphertext, each of them encrypting the corresponding plaintext be leakage functions for setup, search, and update respectively,
in f , and I is an encrypted index, which supports querying which served as the parameters for the definition [11]. A searchable
the ciphertext in c via the Search protocol. Conceptually, this encryption scheme SSE is said to be secure against adaptively
algorithm consists of two disjoint operations: the index generation chosen query attack (CQA2), if for any PPT stateful adversary A,
and the encryption of the plaintext. there exists a PPT stateful simulator S such that for the following
f ← Dec(sk, c): is a deterministic algorithm run by the client. probabilistic experiments RealSSEA (λ) and IdealSSE A,S (λ), we
It takes as input a secret key sk and a ciphertext c, and outputs a have
file f . | Pr[RealSSE (λ) = 1] − Pr[IdealSSE
A A,S (λ) = 1]| ≤ negl(λ).
(cq ; ⊥) ← Search(sk, q; γ): is a (possibly interactive and prob- RealSSE (λ): The challenger runs Gen(1λ ) to generate the secret
A
abilistic) protocol run between the client and the server1 . The key sk . A outputs f and receives γ ← Enc(sk, f ) from the
client side takes as input a secret key sk and a search object q , challenger. A makes a polynomial number of adaptive queries q
while the server side takes as input the encrypted database γ . or updates u. For each query q , the challenger acts as a client
Upon completion of the protocol, the client obtains a sequence of and runs Search with A acting as a server. For each update u,
ciphertexts cq while the server has no local output. the challenger acts as a client and runs Update with A acting
as a server. Finally, A returns a bit b, which is the output of the
(⊥; γ 0 ) ← Update(sk, u; γ): is a (possibly interactive and prob-
experiment.
1. A protocol P run between the client and the server is denoted by
(u; v) ← P (x; y), where x and y are the respective inputs, and u and v IdealSSE
A,S (λ): A outputs f and sends it to S . Given L1 (f ),
are the respective output, of the client and the server. S generates and sends γ to A. A makes a polynomial number
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4
of adaptive queries q or updates u. For each query q , S is we need to enlarge the gap between the two probabilities, i.e.,
given L2 (f , q), and simulates a client who runs Search with A making P1 P2 . In practice, R can be pre-set by users in
acting as an honest server. For each update u, S is given L3 (f , u), the construction of similarity indexes to meet the requirements
and simulates a client who runs Update with A acting as an of different applications. In Section 5, we carefully choose R based
honest server. Finally, A returns a bit b, which is the output of on the characteristics of each dataset.
the experiment.
In this paper, we choose l2 norm, i.e., the Euclidean norm
||x||2 = ( d 2 1/2
P
The bit b output by both the real and ideal experiments, i=1 xi ) , for LSH using 2-stable distribution
which is originated from the adversary A, can be seen as in our construction. Specifically, each hash function is de-
the decision of A in distinguishing whether it is in the fined by
real world or in the ideal world. The intuition is that, a·q+b
ha,b = b c,
since only the leakages are enough to simulate the ideal W
world, any scheme satisfying this definition will only let any where a is a d-dimensional vector with entries chosen inde-
computationally-bounded adversary to learn the leakage pendently from a 2-stable distribution: Normal distribution
2
but not the other information in the real world. g(x) = √12π e−x /2 , and b is chosen uniformly from [0, W ].
The above definition ensures adaptive security when the Note that, W should not be too large, otherwise P1 and P2
query objects are chosen as a function of the encrypted index will both go to 1. In practice, W should be chosen (not too
and the results of previous queries. The leakage functions small) to enlarge the gap between P1 and P2 given a fixed c.
L1 , L2 , and L3 will be defined for specific schemes. More details can be found in the literature [24].
We proceed to define forward privacy. Although some We remark that the LSH family aims at mapping similar
existing schemes (e.g., [7]) is forward private, to the best objects into the same or the neighboring buckets. It is not
of our knowledge, a formal definition of the notion is never meant to be a cryptographic one which aims at providing
explicitly given. We therefore give our own definition below. collision resistance for instance.
Informally, forward privacy means that, for any search
object q , the knowledge of the server learned from previous Pseudo-random function (PRF) and permutation (PRP).
execution of the search protocol on q does not help the Let G : {0, 1}λ × {0, 1}poly(λ) → {0, 1}poly(λ) be a PRF,
server decide whether a newly added ciphertext c relates to which is a polynomial-time computable function that cannot
q or not. Our definition requires that any two “add” updates be distinguished from random function by any polynomial-
of the same size are indistinguishable. time adversary [25], and F be a PRP, which is a PRF but the
function is also bijective.
Definition 6. (Forward Privacy). A searchable encryption Looking ahead, in our construction, for a given key k ,
scheme SSE is said to be forward private, if for any PPT stateful the range of Fk (·) will be as large as the encrypted index;
adversary A, we have and Gk (·) takes as input a bit-string which can be as large
1 as a double of the output length of Fk (·), and outputs a bit
Pr[FwdPrvSSE
A (λ) = 1] ≤ + negl(λ).
2 string of length #f .
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
IEEE Transactions on Dependable and Secure Computing Transactions on Dependable 15
( Volume: and ,Secure Computing
Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5
V
tion) “ ”, which take the same notations as single bit OR 3.2.1 Initialization and Index Construction
and AND operations, respectively. λ
Gen(1 ): Given a security parameter λ, choose the following
Feature-rich data objects are typically represented as uniformly at random from the corresponding domains:
high-dimensional vectors of numerical features. We use pi to
$
denote the d-dimensional feature vector extracted from data • a PRP key k1 ← {0, 1}λ for Fk1 (·),
file i. When representing images, the feature values might • a symmetric key k2 ← SKE.Gen(1λ ) for a CPA-
correspond to the pixels of an image, when representing secure symmetric encryption SKE,
texts perhaps term occurrence frequencies. • functions (h1 , h2 , . . . , hM ) from the LSH family H.
[n] denotes the set of positive integers less than or equal
to n, i.e., [n] = {1, 2, . . . , n}. The M hash functions will be used to generate M hash
tables, each consists of m buckets. The output of this al-
gorithm sk = (k1 , k2 , m, h1 , h2 , . . . , hM ).
2.5 Inverted File Vectors
Enc(sk, f ):
Given a vector v, we refer its ith element as v[i] or vi . We
use ei to denote a unit basis vector which is a bit-vector of 1) [Initialize a temporary array of buckets]
length #f having the ith entry being 1; 0 otherwise. Let x1 , . . . , xM m be the positions of all buckets. For
We use fidj ∈ {0, 1}#f to denote the inverted file identifier each xj , store 0 (a bit-vector of length #f having
vector (IFV) placed at the j th bucket. In fidj , the ith entry all 0 bits) in I[xj ].
of IFV, i.e., fidj [i], is 1 if ei has been mapped into the j th 2) [Prepare the feature-to-file index according to the
bucket; 0 elsewhere . We use fid c to denote a joint IFV com- buckets given by LSH]
puted from the “union” of multiple IFVs, to be explained in For each file fi ∈ f ,
the next section. IFV is a cornerstone of our constructions. (a) Obtain its corresponding d-dimensional feature
vector pi using the feature extraction algorithm for
the specific data type (e.g., color histograms for
3 O UR C ONSTRUCTIONS image data [23]).
This section presents our index data structure for similarity (b) Compute the M hash bucket positions of the
search over feature-rich encrypted multimedia data. LSH (x1 = g1 (pi ), . . . , xM = gM (pi )), where
probability of nearby objects and increasing the chance of the neighboring left and right buckets, respectively.
c j in I[xj ].
Store fid
finding similar data objects close to the query. To do so, we
generate a joint IFV fid
c by calculating the “union” of each 4) [Permute the buckets]
Permute array I according to the mapping xj to the
bucket and its two neighboring buckets, thus probing fid c is
randomized position Fk1 (xj ) for each position xj .
equivalent to probing three adjacent buckets in a shot. For
5) For 1 ≤ i ≤ #f , let ci ← SKE.Enck2 (fi ), where SKE
security, the buckets will be permuted through a pseudo
can be any secure encryption scheme.
random permutation as will be shown in our construction.
6) Output γ = (I, c), where c = (c1 , . . . , c#f ).
We proposed two schemes based on the above ideas by
encrypting fid
c using either Paillier homomorphic cryptosys- Dec(sk, c): Return f ← SKE.Deck2 (c).
tem (with ciphertext packing), or by XOR-ing bit vectors
with the outputs of a PRF. Intuitively, Scheme-I has lower 3.2.2 Search Over the Similarity Index
communication cost where the search results consists of The client searches by first generating a search token τq for
only one Paillier encryption aggregating M encrypted IFVs. the search object q . The server can then use the token to
Our Scheme-II is more computationally efficient due to the locate the buckets in the similarity index I and extract the
use of symmetric encryption, but it requires the server to corresponding joint IFVs. Finally, the server decodes the IDs
return M encrypted IFVs. More details can be found in the of files that are approximate members to the query q , and
experimental results in Table 5. returns corresponding ciphertexts to the client.
Search(sk, q; I):
3.2 Basic Construction without Index Privacy Client Side:
We first present our basic index construction, which does Extract the d-dimensional feature vector q based on the
not protect the privacy of IFVs, as follows. interest of the client. Compute (x1 = g1 (q), . . . , xM =
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions
JOURNAL on Dependable
OF LATEX CLASS and
FILES, VOL. 13, Secure
NO. Computing
9, SEPTEMBER 2014( Volume: 15 , Issue: 3 , May-June 1 2018 ) 6
gM (q)) and the search token τq = (Fk1 (x1 ), . . . , Fk1 (xM )). Original IFV
Extended IFV to be
encrypted
Send the search token τq to the server.
Server Side:
0 log 2 3M - bit
Let y1 = Fk1 (x1 ), . . . , yM = Fk1 (xM ). Extract the cor- 0 Block
responding joint IFVs fidc i from I[yi ] for each i ∈ [M ]. 1 1
VM c 0 1024bit 0
Compute d = i=1 fidi . Return cq = {ci ∈ c : d[i] = 1}. 1 0
1 0
3.2.3 Supporting Efficient Index Dynamics 0
Pad 0's
Update(sk, f 0 ; γ):
...
...
Client Side:
To add new files f 0 , the client locally encrypts the files as 0
cf 0 and builds the sub-index If 0 for ff 0 (which has the same 0 0
1 0
structures as the similarity index on the server). To do so, 0
the client simply runs (If 0 , cf 0 ) ← Enc(sk, ff 0 ) and sets the 0
update token τu = (If 0 , cf 0 ). 1
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7
index I[Fk1 (xi )] (i ∈ [M m]), use Paillier cryptosystem 2) For each file fi ∈ f 0 to be deleted, for the
to encrypt fid
c i under the public key pkp according to the i·dlog2 3M e th
1024 block in all the buckets I[x], and Ifi
packing method presented above. extracted from τu , compute I[x] ← I[x] · Ifi (by
leveraging the additively homomorphic property)
3.3.2 Search Over the Similarity Index and delete ci for all i ∈ i, where i = {i : fi ∈ f 0 }.
After located the hit encrypted IFVs, the server homo- This works since the decimal value of entries corre-
morphically sums up these encrypted IFVs and returns sponding to fi ∈ f 0 in I is changed from M to 3M
the resulting ciphertext to the client. Now for each query and others are turned to 2M .
object q , the server only needs to transfer one ciphertext
corresponding to the sum of the plaintext values (hit IFVs). If the last “blocks” of the original index I do not encode
The client then decrypts the ciphertext and unpacks with the information of dlog1024 files, the client selects part
2 3M e
the blocks according to the alignment set by the above of files from f 0 to construct the sub-index If0 0 to merge into
packing method, i.e., by interpreting the decryption result as the last “blocks” of the index I . This is to make sure that the
a concatenation of many identifiers as bit-strings, and then last “blocks” of the index I are encoded with dlog1024 3M e files. 2
separating them from it. In this way, the client effectively
delegates the public-key computation burden to the server 3.4 Scheme-II: Using Pseudorandom Padding
(i.e., the server integrates all encrypted IFVs into one while
the client only needs to do one decryption). To further improve the computation efficiency we next
introduce Scheme-II by leveraging a “one-time-pad” en-
Search(sk, q; γ): cryption instantiated by PRF. It is CPA-secure due to the
Client Side: pseudorandomness of the PRF, which in turns rely on its
Identical to that in the basic construction. one-wayness [25]. To avoid the reiteration of the same steps,
Server Side: we only present the key differences and modifications.
Let y1 = Fk1 (x1 ), . . . , yM = Fk1 (xM ). Extract cor-
responding encrypted fid c i from I[yi ] for each i ∈ [M ]. 3.4.1 Initialization and Index Construction
M λ
Compute the intermediate encrypted result θq =
Q
Jfid
ciK Gen(1 ): Besides the basic parameters such as m and those
$
i=1 for LSH, we need k1 , k2 , and an additional bit string k3 ←
and return it to the client.
{0, 1}λ chosen uniformly at random to serve as the key of
Client Side:
the PRF Gk3 (·). Output sk = (k1 , k2 , k3 , m, h1 , h2 , . . . , hM ).
1) Decrypt θq using the private key skp .
2) Unpack the decrypted message to obtain b = Enc(sk, f ): Based on the basic similarity index construction,
(b1 , . . . , b#f ), where bi is the binary representation the following additional step is added.
of a dlog2 3M e-bit integer. 3) For each IFV stored in the index I[Fk1 (xi )]
3) Retrieve the ciphertext corresponding to i = {i ∈
(i ∈ [M m]), encrypt fid c i as I[Fk (xi )] :=
[#f ] : bi = M } from the server. L 1
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8
2) Decrypt r using Gk3 (IndexID||Fk1 (xi )) for i ∈ [M ] While the de facto standard of SSE efficiency is sub-linear
to obtain in f , those schemes only work for simple keyword search
M or a boolean expression based on keyword-comparison.
di = ri Gk3 (IndexID||Fk1 (xi ))
M Conceptually, those schemes process a selected set of key-
= I[yi ] Gk3 (IndexID||Fk1 (xi )) words instead of the files itself. Our aim is to support fuzzy
search across many large files, it is practically infeasible to
= fidci.
adopt this approach which requires the preparation of every
VM c
3) Compute d = i=1 fidi . possible “keywords” across a large number of files and
4) Send i = {i : d[i] = 1} to the server, or use PIR. tagging them accordingly. One could alternatively consider
verifying the similarities between each data file and the
Server Side: query object file. However, it either nullifies the advantages
Return cq = {ci : i ∈ i} to the client. of cloud computing for the client (if the processing is done
at the client side) or puts an extremely high burden to
3.4.3 Supporting Efficient Index Dynamics
the cloud (for the cloud to go through each of the large
Update(sk, f 0 ; γ): files to check if it matches with the query, in the encrypted
Client Side: domain). In our schemes, O(#f ) of computation is needed
Based on the basic similarity index construction, the due to the functionality we aim to support — answering
following modifications should be made for addition. unforeseen query and returning result due to similarities.
0 Under this setting, sub-optimal solution may miss some of
1) L by AddVec in the I , set
For each row vector denoted
AddVec[j] = AddVec[j] Gk3 (IndexID||j)[#f + i], the candidates. The elegancy in our design is that, while
where j ∈ [M m]. still requiring O(#f ) operation, we restrict it to the basic
2) Output the encrypted sub-index If 0 and c0 as τu . primitive operations as we previously discussed.
We leverage the cloud to do most of the computation
For deletion using τu , the following adaptions is made.
jobs. Compared with the (trivial) solution of returning
1) For each file fi to be deleted, generate a (M m)- O(#f ) of “objects” (anything after post-processing of the
dimension row vector DelVec, where each entry cloud on encrypted data, say, via fully homomorphic en-
L to 0. Then encrypt it as DelVec[j] =
is initialized cryption; or in the trivial case, the ciphertexts themselves)
#f
DelVec[j] Gk3 (IndexID||j)[i], where j ∈ [M m]. for the client to process, we obtain φ(d) times of savings for
2) Output the encrypted sub-index If 0 as τu . communication and storage costs, where φ(d) denotes the
Server Side: size of the returning candidate set. These significant savings
are also demonstrated numerically in Tables 2, 3, 4.
1) For file addition, append sub-index If 0 extracted
from τu to I and c0 into c, respectively.
2) For file deletion, to delete fid ∈ f 0 , set the ith row of 3.6 Security Analysis
I to the encrypted DelVec of fid extracted from If 0
and delete c0 from c accordingly. For both of our constructions, we define the leakage func-
tions as follows.
It is easy to see the correctness of the update.
Definition 8. (Leakage function L1 for setup). Given a multime-
#f
3.5 Efficiency Analysis dia data collection f , L1 (f ) = {|fi |}i=1 , where | ∗ | denotes the
length of the file.
For the search time efficiency, M hash operations are used to
generate the search token at the client side for both schemes. Definition 9. (Leakage function L2 for search). Given
#f ·dlog2 (3M )e·(M −1)
At the server side, for Scheme-I, 1024 addi- a multimedia data collection f , a search object q ,
tive homomorphic operations are applied to aggregate M L2 (f , q) = {π(f , q), Ap (f , q), E(q)}, where E(q) is
encrypted IFVs while Scheme-II only needs to locate the the identifiers of the buckets corresponding to q , i.e.,
buckets using the search token. For the client to decrypt E(q) = {id(g1 (q)), . . . , id(gM (q))}.
#f ·dlog2 (3M )e
the (intermediate) search results, (Paillier) de-
1024 Specifically, in the above definition, the access pattern
cryption operations, or M XOR operations over bit vectors
implicitly includes intermediate results, e.g., the encrypted
of length #f are needed for Scheme-I and Scheme-II re-
IFVs extracted from the hit buckets. We emphasize that
spectively. Besides, #f checking operations are involved to
the encrypted IFVs and their encrypted aggregation do
evaluate the equality of each decrypted entry to 1 or M .
not reveal additional information since these intermediate
Although the searching time complexity is theoretically
results are encrypted before the recovery of id(cq ).
O(#f ), we still achieve practical efficiency. For Scheme-II,
the computations involved in Scheme-II are XOR and equal- Definition 10. (Leakage function L3 for update). Given a
ity checking, which are extremely efficient. For Scheme- multimedia data collection f , an update object u, L3 (f , u) =
+#f 0
I, due to the use of packing technique, instead of #f {id(fi ), |fi |}#f
i=#f +1 for add updates, and L3 (f , u) =
ciphertexts, only a fraction of it is being processed. Our {id(fi )}i∈#f 0 for delete updates.
experiment also confirmed that the involved computation,
which is public homomorphic aggregation of ciphertexts at Theorem 1. Assuming the decisional composite residuosity prob-
the server side, and the decryption of aggregated ciphertexts lem is hard (resp. the existence of one-way functions), Scheme-I
at the client side, attains practical efficiency. (resp. Scheme-II) is CQA2-secure in the standard model.
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9
Proof. The only difference between Scheme-I and Scheme- Theorem 2. Assuming the decisional composite residuosity prob-
II lies in how the IFVs are encrypted. We only use the lem is hard (resp. the existence of one-way functions), Scheme-I
fact that both methods are CPA-secure in the following (resp. Scheme-II) is forward private in the standard model.
proof, without utilizing the homomorphic property. Thus,
the proofs for both schemes are identical. Proof. By the same argument as in the proof of Theorem 1,
For either scheme, we argue its CQA2-security by we can assume that the encryption schemes in both Scheme-
describing a polynomial-time simulator S , such that for I and Scheme-II are CPA-secure. In the forward privacy
game, upon receiving the challenge updates u0 and u1 ,
any PPT adversary A, the outputs of RealSSE A (λ) and
the challenger simply encrypts zeros to produce dummy
IdealSSE
A,S (λ) are indistinguishable, as follows:
ciphertexts c0 and dummy index If 0 . If the adversary can
[Setup] S is given L1 (f ) = {|fi |}i=1 . Let I be a
#f distinguish the dummy update from the real ones, we can
dictionary of size M m mapping λ-bit random strings construct an adversary which breaks the CPA-security of
to encryption of zeros by the underlying encryption the encryption scheme, which happens with probability less
scheme. The number of zeros encrypted is determined than negl(λ). Suppose that the adversary cannot distinguish
by #f (and the packing method described in the the above. The simulated ciphertexts and index contain no
construction for Scheme-I). Generate k20 ← SKE.Gen(1λ ) information about u0 and u1 , so the success probability of
and ci ← SKE.Enck20 ({0, 1}|fi | ) for i ∈ [#f ]. Output the adversary is exactly 21 . Thus, the overall success proba-
γ = (I, c), where c = {c1 , . . . , c#f }. S maintains internally bility of the adversary is bounded above by 12 + negl(λ).
a mapping from the M m identifiers of the buckets to the
λ-bit random strings generated above.
4 T HEORETICAL A NALYSIS OF S EARCH Q UALITY
[Simulating Queries] S is given L2 (f , q) = For search quality of our system, the computation of union
{π(f , q), Ap (f , q), E(q)}. Consider the M identifiers of of IFVs in consecutive neighboring buckets can greatly re-
buckets corresponding to the query object q . S sends the duce the false negative rate, and the computation of bitwise
random strings corresponding to these identifiers to A. intersection of IFVs in multiple hashing-hit buckets during
Upon verifying the response from A against the index I , S search can eliminate a large amount of false positives.
sends the file identifiers given in Ap (f , q) to A, who then Based on Definition 7, each LSH hi (q) = b a·q+b W c maps
returns the corresponding ciphertexts. the d-dimensional points q onto intervals of length W .
After mapping all feature vectors p extracted from f , the
[Simulating Updates] For add updates, S is given
+#f 0
probability that a query q falls into the same bucket as some
L3 (f , u) = {id(fi ), |fi |}#f i=#f +1 , which are simulated point p depends on the interval length W , and the distance
as in the setup phase. For delete updates, S is given between q and p (i.e., k p − q k2 ).
L3 (f , u) = {id(fi )}i∈#f 0 . S simulates a fresh If 0 as in the Fig. 2 shows the probability that q falls into the same
setup phase, and sends (If 0 , {id(fi )}i∈#f 0 ) to A. bucket as p and the probability that q falls into the left or
The indistinguishability between RealScheme−IA (λ) and right neighboring bucket of p. Consider an LSH hi (·). For a
IdealScheme−I
A,S (λ) follows from the CPA security of SKE or fixed p, let fp (q) be the probability that q and p fall into the
the XOR-based encryption scheme, and the indistinguisha- same interval (i.e., collide with each other under ha,b (·)). In
bility of PRF from random functions. addition, we use fp− (q) and fp+ (q) to denote the probability
that q falls into the neighboring left and right intervals, re-
It is easy to note that the simulation above is quite spectively. For ease of analysis, we let x(δ) (δ ∈ {−1, 1}) be
simple. Specifically, our schemes only return an aggregated the absolute distance of p from the boundary of the interval
encrypted IFV (for Scheme-I) or a number of encrypted IFVs ha,b (p) + δ . According to the definition of ha,b (q) = b a·q+b
W c,
(for Scheme-II), instead of directly retrieving the encrypted the difference between the projections of q and p onto the
results for the client. That is why the simulator can just line is (a · q + b) − (a · p + b) = a · (q − p), distributed
return a (set of) random string(s) with appropriate length as k p − q k2 X , where X follows a Gaussian distribution.
since the adversary simply cannot distinguish without the When W is large enough, q can fall into the same bucket as
corresponding secret key for decryption. Also, in retrieving p, or its left/right neighboring slots. Formally:
the encrypted results for the client in the final stage, the Lemma 3. The probability that two items p and q collide for
simulator does not need to directly work with the encrypted an LSH is
index. Instead, the information available from the leakage Z W
function already allows it to do that. From another perspec- 1 t t
fp (q) = p(u) = Pr[ha,b (q) = ha,b (p)] = g( )(1− )dt,
tive, the adaptive security of our scheme can be achieved in 0 u u W
the standard model due to the non-trivial communication
where p(u) is the probability as a function of u, u =k p−q k2
overhead, i.e., the encrypted IFV is of length O(#f ).
and g(t) is the probability density function of the absolute
For forward privacy, observing that adding new files to 2
value of the 2-stable distribution, namely, g(t) = √12π e−t /2 .
the database introduces new columns in encrypted form;
and the decryption of IFVs is performed at the client Proof. Set t = |(a · p + b) − (a · q + b)| = |a · (p − q)|,
side during queries. Thus, even if equipped with previous x = ut . Therefore, x is distributed as X . As shown in
queries, it is obvious that the server is unable to decrypt the Fig. 3, the probability that a·q+b falls into the same slot as
u
newly added portions of the IFVs. a·p+b RW W
u −x
u is 0
u
g(x)( W )dx . According to the substitution
u
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 11
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 12
#f
Recall Error Search Time Search Time M m W Space Cost Space Cost φ(d)
Ratio of Scheme-I (s) of Scheme-II (s) of Scheme-I (MB) of Scheme-II (MB)
0.9878 1.0006 49.94 0.1561 5 37 300 102.53 12.81 112
0.9527 1.0036 49.94 0.1561 5 49 240 135.78 16.97 236
0.9132 1.0086 49.94 0.1561 5 53 210 146.88 18.35 373
1 1 1
0.95
0.85
M=5
0.9 0.6
0.8 M=10
Recall
Recall
Recall
M=5
M=10 M=15
0.75 M=5 M=15
0.85 0.4
M=10
0.7
M=15
0.65 0.8 0.2
0.6
0.55 0.75 0
160 180 200 220 240 260 280 300 400 600 800 1000 1200 2 4 6 8 10
W W W
(a) Forest Trace dataset (b) Image dataset I (c) Image dataset II
Fig. 4: Search quality for different W (R = 150 for Forest Trace dataset, R = 500 for image dataset I, k = 10 for image
dataset II)
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
IEEE Transactions on Dependable and Secure Computing Transactions on Dependable15
( Volume: and ,Secure
Issue:Computing
3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 13
M =5 M = 10 M = 15
Scheme-I (KB) Scheme-II (KB) Scheme-I (KB) Scheme-II (KB) Scheme-I (KB) Scheme-II (KB)
Forest Covertype Trace 567.39 709.24 712.02 1418.48 854.43 2127.72
Image Data I 58.11 72.63 72.91 145.26 87.5 217.89
Image Data II 14.60 18.25 18.32 36.50 21.99 54.75
TABLE 5: The communication costs of Scheme-I and Scheme-II are #f · dlog2 3M e bits and #f · M bits, respectively.
(a) Forest Trace dataset (b) Image dataset I (c) Image dataset II
Fig. 7: Recall of the files with different distance to the query (M = 5, R = 150 and W = 300 for Forest Trace dataset;
M = 5, R = 500 and W = 800 for image dataset I; M = 5, k = 10 (R ≈ 3.3277), and W = 10 for image dataset II)
the forest dataset, r < 300 for the image dataset I and
&'$%
!
$
r < 5 for the image dataset II). The results are consistent
$%
"
%&''(
Table 5 gives the communication costs of running
!
Scheme-I and Scheme-II with different datasets. The exper-
imental results show that while the communication cost
% %&''
of Scheme-II is slightly larger than that of Scheme-I, the
! " # $% $ $! ! " # $% $ $!
'( )* communication efficiency of both schemes are practically
(a) Forest Trace dataset (W = 300) (b) Forest Trace dataset (W = 300)high. For search efficiency, the search time in Scheme-II is
)*$% two orders of magnitudes smaller than that in Scheme-I.
!
(&' $
( %&'')
,-./01*23*104-1506*37809
&'
%&''
6 C ONCLUSION
+,-.//
%&'#)
$&'
$
%&'#
We investigated the problem of privacy-preserving sim-
%&'
%&'() ilarity search over encrypted feature-rich data. We pro-
%
%&'( posed a high-speed and compact similarity search index
! " # $% $ $! ! " # $% $ $!
*+ * supporting efficient file and index updates. Based on well-
(c) Image dataset I (W = 800) (d) Image dataset I (W = 800) defined security models with leakages, we proved our index
#""" !
constructions are semantically secure against adaptively
chosen query attack. Theoretical performance analysis was
also presented to carefully characterize our index designs.
&'()*+,-.,+*/'+0*1,.23*4
! ""
"%&&
Utilizing three different representative real-world datasets,
)*+,--
" "%&'
!" ! #" # $" !" ! #" # $"
% (
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2016.2593444, IEEE
Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing ( Volume: 15 , Issue: 3 , May-June 1 2018 )
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 14
1545-5971 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.