Professional Documents
Culture Documents
64,
NO. 12,
DECEMBER 2015
3569
INTRODUCTION
ITH the explosive growth of digital data, deduplication techniques are widely employed to backup data
and minimize network and storage overhead by detecting
and eliminating redundancy among data. Instead of keeping multiple data copies with the same content, deduplication eliminates redundant data by keeping only one
physical copy and referring other redundant data to that
copy. Deduplication has received much attention from both
academia and industry because it can greatly improves storage utilization and save storage space, especially for the
applications with high deduplication ratio such as archival
storage systems.
A number of deduplication systems have been proposed
based on various deduplication strategies such as clientside or server-side deduplications, file-level or block-level
deduplications. A brief review is given in Section 6. Especially, with the advent of cloud storage, data deduplication
techniques become more attractive and critical for the
0018-9340 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
3570
systems have only been considered in a single-server setting. However, as lots of deduplication systems and cloud
storage systems are intended by users and applications for
higher reliability, especially in archival storage systems
where data are critical and should be preserved over long
time periods. This requires that the deduplication storage
systems provide reliability comparable to other high-available systems.
Furthermore, the challenge for data privacy also arises
as more and more sensitive data are being outsourced by
users to cloud. Encryption mechanisms have usually been
utilized to protect the confidentiality before outsourcing
data into cloud. Most commercial storage service provider are reluctant to apply encryption over the data
because it makes deduplication impossible. The reason is
that the traditional encryption mechanisms, including
public key encryption and symmetric key encryption,
require different users to encrypt their data with their
own keys. As a result, identical data copies of different
users will lead to different ciphertexts. To solve the problems of confidentiality and deduplication, the notion of
convergent encryption [4] has been proposed and widely
adopted to enforce data confidentiality while realizing
deduplication. However, these systems achieved confidentiality of outsourced data at the cost of decreased
error resilience. Therefore, how to protect both confidentiality and reliability while achieving deduplication in a
cloud storage system is still a challenge.
VOL. 64,
NO. 12,
DECEMBER 2015
words, any of the servers can obtain shares of the data stored
at the other servers with the same short value as proof of
ownership (PoW). Furthermore, the tag consistency, which
was first formalized by [5] to prevent the duplicate/ciphertext replacement attack, is considered in our protocol. In
more details, it prevents a user from uploading a maliciously-generated ciphertext such that its tag is the same
with another honestly-generated ciphertext. To achieve this,
a deterministic secret sharing method has been formalized
and utilized. To our knowledge, no existing work on secure
deduplication can properly address the reliability and tag
consistency problem in distributed storage systems.
This paper makes the following contributions.
Four new secure deduplication systems are proposed to provide efficient deduplication with high
reliability for file-level and block-level deduplication, respectively. The secret splitting technique,
instead of traditional encryption methods, is utilized
to protect data confidentiality. Specifically, data are
split into fragments by using secure secret sharing
schemes and stored at different servers. Our proposed constructions support both file-level and
block-level deduplications.
Security analysis demonstrates that the proposed
deduplication systems are secure in terms of the definitions specified in the proposed security model. In
more details, confidentiality, reliability and integrity
can be achieved in our proposed system. Two kinds
of collusion attacks are considered in our solutions.
These are the collusion attack on the data and the
collusion attack against servers. In particular, the
data remains secure even if the adversary controls a
limited number of storage servers.
We implement our deduplication systems using the
Ramp secret sharing scheme (RSSS) that enables
high reliability and confidentiality levels. Our evaluation results demonstrate that the new proposed
constructions are efficient and the redundancies are
optimized and comparable with the other storage
system supporting the same level of reliability.
1.2 Organization
This paper is organized as follows. In Section 2, we present
the system model and security requirements of deduplication. Our constructions are presented in Sections 3 and 4.
The security analysis is given in Section 5. The implementation and evaluation are shown in Sections 6, and related
work is described in Section 7. Finally, we draw our conclusions in Section 8.
PROBLEM FORMULATION
3571
ciphertext such that its tag is the same with another honestly-generated ciphertext, the cloud storage server can
detect this dishonest behavior. Thus, the users do not need
to worry about that their data are replaced and unable to be
decrypted. Message authentication check is run by the
users, which is used to detect if the downloaded and
decrypted data are complete and uncorrupted or not. This
security requirement is introduced to prevent the insider
attack from the cloud storage service providers.
Reliability. The security requirement of reliability in deduplication means that the storage system can provide fault tolerance by using the means of redundancy. In more details, in
our system, it can be tolerated even if a certain number of
nodes fail. The system is required to detect and repair corrupted data and provide correct output for the users.
3572
VOL. 64,
NO. 12,
DECEMBER 2015
FURTHER ENHANCEMENT
3573
3574
4.2
VOL. 64,
NO. 12,
DECEMBER 2015
SECURITY ANALYSIS
3575
EXPERIMENT
We describe the implementation details of the proposed distributed deduplication systems in this section. The main
tool for our new deduplication systems is the Ramp secret
sharing scheme [7], [8]. The shares of a file are shared across
multiple cloud storage servers in a secure way.
The efficiency of the proposed distributed systems are
mainly determined by the following three parameters of n,
k, and r in RSSS. In this experiment, we choose 4 KB as the
default data block size, which has been widely adopted for
block-level deduplication systems. We choose the hash
function SHA-256 with an output size of 32 bytes. We
implement the RSSS based on the Jerasure Version 1.2 [13].
We choose the erasure code in the (n; k; r)-RSSS whose
generator matrix is a Cauchy matrix [14] for the data
encoding and decoding. The storage blowup is determined
n
by the parameters n, k, r. In more details, this value is kr
in theory.
All our experiments were performed on an Intel Xeon
E5530 (2.40 GHz) server with Linux 3.2.0-23-generic OS. In
the deduplication systems, the n; k; r-RSSS has been used.
For practice consideration, we test four cases:
case 1: r 1; k 2, and 3 n 8 (Fig. 3a);
case 2: r 1; k 3 and 4 n 8 (Fig. 3b);
case 3: r 2; k 3, and 4 n 8 (Fig. 3c);
case 4: r 2; k 4, and 5 n 8 (Fig. 3d).
As shown in Fig. 1, the encoding and decoding times of
our deduplication systems for each block (per 4 KB data
block) are always in the order of microseconds, and hence
are negligible compared to the data transfer performance in
the Internet setting. We can also observe that the encoding
time is higher than the decoding time. The reason for this
result is that the encoding operation always involves all n
shares, while the decoding operation only involves a subset
of k < n shares.
The performance of several basic modules in our constructions is tested in our experiment. First, the average
time for generating a hash function with 32-byte output
from a 4 KB data block is 25.196 usec. The average time is
30 ms for generating a hash function with the same output
length from a 4 MB file, which only needs to be computed
by the user for each file.
Next, we focus on the evaluation with respect to some
critical factors in the n; k; r-RSSS. First, we evaluate
the efficiency between the computation and the number of
S-CSPs. The results are given in Fig. 2, which shows the
encoding/decoding times versus the number of S-CSPs n.
In this experiment, r is set to be 2 and the reliability level
n k 2 are also fixed. From Fig. 2, the encoding time
3576
VOL. 64,
NO. 12,
DECEMBER 2015
Fig. 1. The encoding and decoding time for different RSSS parameters.
RELATED WORK
3577
CONCLUSIONS
ACKNOWLEDGMENTS
The authors would like to extend their sincere appreciation
to the Deanship of Scientific Research at King Saud University for its funding of this research through the Research
Group Project no RGP-VPP-318. This work was supported
by National Natural Science Foundation of China (Nos.
61472091, 61472083, U1405255, 61170080, 61272455 and
U1135004), 973 Program (No. 2014CB360501), Natural
Science Foundation of Guangdong Province (Grant
No. S2013010013671), Program for New Century Excellent
Talents in Fujian University (JA14067), Fok Ying Tung
Education Foundation (141065), the fund of the State Key
Laboratory of Integrated Services Networks (No. ISN15-02),
and the Australian Research Council (ARC) Projects
DP150103732, DP140103649, and LP140100816. Y. Xiang is
the corresponding author.
REFERENCES
[1]
3578
[11] J. Li, X. Chen, M. Li, J. Li, P. Lee, and W. Lou, Secure deduplication with efficient and reliable convergent key management,
IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 6, pp. 16151625, Jun.
2014.
[12] S. Halevi, D. Harnik, B. Pinkas, and A. Shulman-Peleg, Proofs of
ownership in remote storage systems, in Proc. ACM Conf. Comput. Commun. Secur., 2011, pp. 491500.
[13] J. S. Plank, S. Simmerman, and C. D. Schuman, Jerasure: A
library in C/C++ facilitating erasure coding for storage applicationsVersion 1.2, Univ. of Tennessee, TN, USA: Tech. Rep. CS08-627, Aug. 2008.
[14] J. S. Plank and L. Xu, Optimizing Cauchy Reed-Solomon Codes
for fault-tolerant network storage applications, in Proc. 5th IEEE
Int. Symp. Netw. Comput. Appl., Jul. 2006, pp. 173180.
[15] C. Liu, Y. Gu, L. Sun, B. Yan, and D. Wang, R-admad: High reliability provision for large-scale de-duplication archival storage
systems, in Proc. 23rd Int. Conf. Supercomput., 2009, pp. 370379.
[16] M. Li, C. Qin, P. P. C. Lee, and J. Li, Convergent dispersal:
Toward storage-efficient security in a cloud-of-clouds, in Proc.
6th USENIX Workshop Hot Topics Storage File Syst., 2014, pp. 11.
[17] P. Anderson and L. Zhang, Fast and secure laptop backups with
encrypted de-duplication, in Proc. USENIX 24th Int. Conf. Large
Installation Syst. Admin., 2010.
[18] Z. Wilcox-OHearn and B. Warner, Tahoe: The least-authority
filesystem, in Proc. ACM 4th ACM Int. Workshop Storage Security
Survivability, 2008, pp. 2126.
[19] A. Rahumed, H. C. H. Chen, Y. Tang, P. P. C. Lee, and J. C. S. Lui,
A secure cloud backup system with assured deletion and version
control, in Proc. 3rd Int. Workshop Secur. Cloud Comput., 2011.
[20] M. W. Storer, K. Greenan, D. D. E. Long, and E. L. Miller, Secure
data deduplication, in Proc. 4th ACM Int. Workshop Storage Security Survivability, 2008, pp. 110.
[21] J. Stanek, A. Sorniotti, E. Androulaki, and L. Kencl, A secure data
deduplication scheme for cloud storage, in Tech. Rep., 2013.
[22] D. Harnik, B. Pinkas, and A. Shulman-Peleg, Side channels in
cloud services: Deduplication in cloud storage, IEEE Secur. & Privacy, vol. 8, no. 6, pp. 4047, Nov./Dec. 2010.
[23] R. D. Pietro and A. Sorniotti, Boosting efficiency and security in
proof of ownership for deduplication, in Proc. ACM Symp. Inf.,
Comput. Commun. Secur., 2012, pp. 8182.
[24] J. Xu, E.-C. Chang, and J. Zhou, Weak leakage-resilient clientside deduplication of encrypted data in cloud storage, in Proc.
8th ACM SIGSAC Symp. Inform., Comput. Commun. Security, 2013,
pp. 195206.
[25] W. K. Ng, Y. Wen, and H. Zhu, Private data deduplication protocols in cloud storage, in Proc. 27th Annu. ACM Symp. Appl. Comput., 2012, pp. 441446.
[26] G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, and D. Song, Provable data possession at untrusted stores,
in Proc. 14th ACM Conf. Comput. Commun. Secur., 2007, pp. 598
609.
[Online].
Available:
http://doi.acm.org/10.1145/
1315245.1315318
[27] A. Juels and B. S. Kaliski, Jr., Pors: Proofs of retrievability for
large files, in Proc. 14th ACM Conf. Comput. Commun. Secur., 2007,
pp. 584597. [Online]. Available: http://doi.acm.org/10.1145/
1315245.1315317
[28] H. Shacham and B. Waters, Compact proofs of retrievability, in
Proc. 14th Int. Conf. Theory Appl. Cryptol. Inform. Security: Adv.
Cryptol., 2008, pp. 90107.
Jin Li received the BS degree in mathematics
from Southwest University, in 2002 and the
PhD degree in information security from Sun
Yat-sen University in 2007. Currently, he
works at Guangzhou University as a professor.
He has been selected as one of science and
technology new stars in Guangdong province.
His research interests include cloud computing
security and applied cryptography. He has
published more than 70 research papers in
refereed international conferences and journals. His work has been cited more than 2,400 times at Google
Scholar. He also has served as the program chair or program committee member in many international conferences.
VOL. 64,
NO. 12,
DECEMBER 2015
Yang Xiang received the PhD degree in computer science from Deakin University, Australia.
He is currently a full professor at the School of
Information Technology, Deakin University. He is
the director of the Network Security and Computing Lab (NSCLab). His research interests include
network and system security, distributed systems, and networking. In particular, he is currently leading his team developing active defense
systems against large-scale distributed network
attacks. He is the chief investigator of several
projects in network and system security, funded by the Australian
Research Council (ARC). He has published more than 130 research
papers in many international journals and conferences, such as IEEE
Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Information Security and Forensics, and IEEE Journal on Selected Areas in Communications. Two of
his papers were selected as the featured articles in the April 2009 and
the July 2013 issues of the IEEE Transactions on Parallel and Distributed Systems. He has published two books, Software Similarity and
Classification (Springer) and Dynamic and Advanced Data Mining for
Progressing Technological Development (IGI-Global). He has served as
the program/general chair for many international conferences such as
ICA3PP 12/11, IEEE/IFIP EUC 11, IEEE TrustCom 13/11, IEEE HPCC
10/09, IEEE ICPADS 08, NSS 11/10/09/08/07. He has been the PC
member for more than 60 international conferences in distributed systems, networking, and security. He serves as the associate editor of
IEEE Transactions on Computers, IEEE Transactions on Parallel and
Distributed Systems, Security and Communication Networks (Wiley),
and the editor of Journal of Network and Computer Applications. He is
the co-ordinator, Asia for IEEE Computer Society Technical Committee
on Distributed Processing (TCDP). He is a senior member of the IEEE.
3579