You are on page 1of 4

2010 Interational Conference on Future Information Technology and Management Engineering

Data Deduplication Techniques


Qinlu He, Zhanhuai Li, Xiao Zhang
Department of Computer Science
Northwestern Poly technical University
Xi'an, P.R. China
luluhe8848@hotmail.com
Abstract-With the information and network technology, rapid
development, rapid increase in the size of the data center, energy
consumption in the proportion of IT spending rising. In the great
green environment many companies are eyeing the green store,
hoping thereby to reduce the energy storage system. Data de
duplication technology to optimize the storage system can greatly
reduce the amount of data, thereby reducing energy consumption
and reduce heat emission. Data compression can reduce the
number of disks used in the operation to reduce disk energy
consumption costs. By studying the data de-duplication strategy,
processes, and implementations for the following further lay the
foundation of the work.
Keyword-cloud storage; green storage ; data dedplcation
I. INTRODUCTION
Now, green-saving business more seriously by people,
especially in te interational fmacial crisis has not yet
cleared when, how cost always attached great importace to the
issue of major compaies. It is against this backgroud, the
green store data de-duplication technology to become a hot
topic.
In the large-scale storage system environment, by setting
the storage pool and shae the resouces, avoid te different
users to prepare teir fee storage space. Data de-duplication
ca signifcatly reduce backup data storage, reducing storage
capacity, space ad energy consumption [1]. The major storage
vendors are launching related products or services, such as
IBM in the IBM System Storage TS7650G ProtecTIER
products using data deduplication solution (Diligent) can
demand up to te physical storage devices to 1 / 25. By
reducing te demad for storage hardware, reduce overall costs
and reduce energy consuption. NetApp V Series supports
redudat data removed, at least 35% can reduce the aout of
data. Deletion of the redudat data technology, now more of a
secondar storage doing fling, copying the sae data when
multiple copies of the deletion. Trend of te fture will be, in
the main storage system, the removal of duplicate data in real
time.
II. DATA DEDUPLICATION
Duplication[2] [3] is ver simple, encoutered repeated
duplication of data when data is not saved backup, instead, add
a point to the frst (and only one) index data. Duplication is not
a new thing, in fact it is just data compression derivatives. Data
compression within a single fle to delete duplicate data,
replacing the frst data point to te index. Duplication extend
this concept as follows:
978-1-4244-9088-2/10/$26.00 2010 IEEE
430

Within a single fle (complete agreement with data
compression)

Cross-docuent

Cross-Application

Cross-client

Across time
I
V
Data
Center
Figure 1. Where deduplication can happend
Duplication is a data reduction technique, commonly used
in disk-based backup systems, storage systems designed to
reduce the use of storage capacity. I works in a diferent time
period to fmd duplicate fles in different locations of variable
size data blocks. Duplicate data blocks replaced wit te
indicator. Highly redudat data sets[4][5] (such as backup
data) fom the data de-duplication technology to beneft greatly;
users ca achieve 10 to 1 to 50 to 1 reduction ratio. Moreover,
data deduplication technology can allow users to effciently
between te different sites, the economy back up data
replication.
Compression through te compression algorithms to
eliminate redudant data contained in a docuent to reduce fle
size, but duplication is distributed through the algorithm to
eliminate the sae fle storage system or data block.
Data deduplication technology is different fom the normal
compression[13][14]. If you have two identical fles, data
compression, data will be repeated for each fle ad replace the
exclusion of the frst data point to the index; te duplicate data
were excluded to distinguish the two docuents ae identical,
so only Save te frst fle. Moreover, it with data compression,
as the frst fle exclude duplication of data, tereby reducing
the size of the stored data.
FITME2010
document1.docx
6MB
Whut De-dup
6MB
Wh DEdup
6MB
document2.docx
6 MB
6MB
OMB
document ne. docx
6MB
6MB
1 MB
Figure 2. Data deduplication[6]
18 MB
7 MB
Duplication is also diferent fom the noral incremental
backup. The thst of incremental backups back up only new
data generated, and data de-duplication technology, the key is
to retain only the only instace of the data, so data
deduplication technology to reduce the aount of data storage
has become more efective. Most maufacturers claim that
their data deduplication products can be reduced to te noral
capacity of 1120. Data de-duplication technology, te basic
principle is to flter the data block to fnd the same data block,
and the only instance of a pointer to point to replace.
= new unique data
repeat data
I = pointer to unique data segment
Figure 3. shows data deduplication technology
Basically, it can reduce te storage space occupied by te
data. This will bring te following benefts[12]:

IT savings fnds (do not need the extra space needed
to increase investent)

Reduce the backup data, data snapshots of te size
of the (cost-saving, saving time, etc.)

Less power pressure (because of less hard, less tape,
etc.)

Save network badwidth (because only fewer data)

Saving time

Because of te need less storage space, disk backup
possible.
Backup equipment is always flled with a lot of redudat
data. To solve this problem, save more space, "duplication"
techology will be a matter of couse become te focus of
attention. Use "data deduplication" technology ca store the
original data reduced to 1120, so tat more backup space, not
only ca save te backup data on disk longer, but also can save
ofine storage a lot of badwidth required.
431
Bytes In Bytes Out
Bytes In Byes In - Bytes Out
Ratio= - %=
Bytes Out Bytes In
1
Ratio= (-)
1 -%
III. DATA DEDUPLICATION STRATEGY
Data de-duplication technology to identif duplicate data,
eliminate redundancy and reduce the need to tasfer or store
the data in the overall capacity [7][8]. Duplication to detect
duplicate data elements, to judge a fle, block or bit it and
another fle, block or bit te same. Data de-duplication
technology to use mathematics for each data element, "hash"
algorithms to deal with, And get a uique code called a hash
authentication number .. Each nuber is compiled ino a list,
this list is ofen refered to as hash index.
At present mainly the fle level, block-level and byte-level
deletion stategy, they ca be optimized for storage capacit.
A. File-level data deduplication strateg
File-level deduplication is ofen refered to as Single
Instance Storage (SIS)[9], check the index back up or achive
fles need the attributes stored in the fle with the compaison.
If not the sae fle, it will store and update te index;
Otherwise, the only deposit pointer to an existing fle.
Therefore, the same fle saved only one instance, and ten copy
all the "stub" alterative, while the "stub" pointing to the
original fle.
B. Block-level data deduplication technolog
Block-level data deduplication techology[1 0] [1 1] to data
steam divided into blocks, check the data block, ad determine
whether it met the sae data before te block (usually on the
implementation of te hash algorithm for each data block to
form a digital signature or unique identifer) .. If the block is
unique and was written to disk, its identifer is also stored in
the index; Otherwise, te only deposit pointer to store te sae
data block's original location. Tis method poiner wit a
small-capacity alterative to the duplication of data blocks,
rather tha storing duplicate data blocks again, thus saving disk
storage space. Hash algorithm used to judge duplicate data,
may lead to confict between te hash eror. M5, SHA-I hash
algorithm, etc. are checked against the data blocks to for a
unique code. Altough there ae potential conficts and hash
data corption, but were less likely.
Figure 5. logical structure of data de-duplication strategy
1) Remove the efciency of fle-level technolog than the
case of block-level technolog:
File interal chages, will cause the entire fle need to store.
PPT ad other fles may need to chage some simple content,
such as chaging the page to display the new report or te dates,
which ca lead to re-store the entire document. Block-level
data de-duplication technology stores only one version of the
paper ad the next part of the chages between versions. File
level technology, generally less than 5:1 compression ratio,
while te block-level storage technology ca compress the data
capacit of 20: 1 or even 50: 1.
2) Remove fle-level technolog, more efcient than block
level technolog scenarios:
File-level data de-duplication technology, the index is ver
small, the judge repeated the data only taes ver little
computing time. Therefore, the removal process has little
impact on backup performance. Because the index is small,
relatively low fequency, document-level processing load
required to remove the technology low. Less impact on the
recover time. Remove the technical need to use block-level
prima index matching block and the data block pointer to
"reassemble" the data block. Te fle-level technology is a
unique document storage ad point to te fle pointer, so little
need to restructue.
C Byte-level data deduplication
Analysis of data fom the byte steam level data de
duplication is anoter way. The new data stream ad have
stored more bytes of data steam one by one, to achieve higher
accuacy.
With byte-level technology products are usually able to
"identif the content," I other words, the supplier of the
backup process the data fow implementation of te reverse
engineering to lea how to retrieve the fle name, fle type,
date / time stamp and oter inforation[14].
In deterining duplicate data, this metod ca reduce the
computational load. Waing? Tis method usually play a role
in the post-processing stage - te backup is complete, the
backup data to judge whether the repeat. Terefore, the need to
back up the entire disk of data, must have the disk cache, to
perform data deduplication process[15]. Moreover, the
432
duplication process may be limited to a backup set of backup
data stream, rather than applied to te backup group.
Completed the duplication process, the byte-level
technology ca recover disk space. I the recover room before
the consistency check should be implemented to ensure tat
duplicate data afer deletion, te original data ca still meet the
goal. To retain the last fll backup, so the recover process
need not rely on reconstcted data, speed up the recovery
process.
Dsk ay
Writen,"
li1l:
Figure 6. byte-level data de-duplication strategy for logical structure
IV. DATA DEDUPLICATION PROCESS
The basic steps to delete duplicate data consists of fve
stages[3][8][12][15]:
The frst phase of data collection phase, by comparing the
old and the new backup data backup, reducing te scope of te
data.
Comparing te second phase of the process of identifing
data, in bytes, of the data collection phase marks a similar data
objects. If the frrst phase of the work sheet created the need for
data identifcation, then we must use a specifc algorith to
determine which data backup group is unique, what data is
repeated. If the frrst phase identifed fom the meta-data level
of data and backup group, te sae as the previous backp,
then the recognition stage in the data bytes of data will be
compaed.
The third phase of the data is re-assembled, new data is
saved, the previous stage was maked duplicate data is saved
data pointer replacement. Te end result of this process is to
produce a copy of the deleted afer the backup group view
The fourth stage will actually remove all te duplicate data
before performing a data integrity check efcacy.
Finally remove te redundat storage of data, the release of
previously occupied disk space for other uses.
V. IMLEMNTATIONS
In accordace with the duplication occured in the location
of business systems, implementation ca be divided into
foreground ad background processing two types of teatment.
Foreground processing is done trough meas of pure
sofwae implementation. Sofare itself is a re-delete fction
of the backup sofware, use the Client / Server structure. Server
to develop strategies ad initiate a backup at the appoined time,
scheduling Client on the backup data block, data block input
Hash operation procedures, the results included in Hash Table.
Block behind te operation result with the same value exists in
the table to delete data, different results on the record and save
the data block[12][14]. Data recover ad preservation of the
value under te table data blocks to restore deleted data, data
recover to the restructuring of the system. Implementations
advatage of data output to the network before the hosts in the
business to achieve te re-delete operation, to reduce network
traffc and reduce the amount of storage space.
Applictio
Serers
--
Oe Disster
Recver
Figure 7. Realization font schematic processing fnctions
Backgroud processing method used to achieve integrated
sofwae ad hadwae equipment, te overall advatage of te
heavy equipment used to delete the CPU ad memor, the
operation of the business applications are not afected. Deletion
in the integrated device in accordance with re-Iocation took
place is divided into In-line and Post-processing implemented
in two ways.
In-line implementations also based on the Hash algorithm,
data written to device memor, delete the bufer for re
operation, vaiable-length data block partition fnction of te
frst sca data, analysis can produce the maximum repetition
rate of the split point ad then split the data and form vaiable
size data blocks, Hash algorithm based on principle, afer re
delete processing[12][14][15].
Post-processing methods used to achieve diferential
algorithm Hash algorithm technology or technology[12][13].
The biggest difference wit other implementations when the
data is written, does not deal with being directly saved to te
integration of storage devices, ad then to re-delete operation.
Based on 1 Byte units can scan the maximum found in te
duplication of data, provides a maximum deletion ratio,
according to dozens of diferent data tpes can achieve a ratio
of 1 to 1000 than te compression utility. As the latest data is
so well preserved that in accordace with the rules and
regulations for the need to ensure the authenticity of the user
data copy, meet compliace requirements.
VI. CONCLUSION AD FUTURE WORK
Wit the inforation and network technology, rapid
development, rapid increase in the size of the data center,
433
energy consumption in IT spending in the increasing
proportion of data deduplication to optimize storage system can
greatly reduce the amout of data, tereby reducing energy
consumption and reduce heat emissions. Data compression can
reduce the number of disks used in the operation to reduce disk
energy consumption costs. Remove duplicate data for the large
data center information technology system backp system a
comprehensive, mature, safe and reliable, More green save te
backup data storage technology solutions, has a ver high value
ad great academic value., wit ver high application value
and important academic research value.
ACKNOWLEDGMNT
This work was supported by a grat fom the National High
Technology Research and Development Program of China
(863 Program) (No. 2009AAOIA404).
REFERENCES
[I] McKnight J,Asaro T,Babineau B.Digital achiving:End-User survey and
market forecast 2006-2010.2006.http://www.
enterprisestrategygroup.comlESGPublications/ReportDetail.asp?Reportl
0=591
[2] FalconStor Sofware, Inc. 2009. Demystifing Data Reduplication:
Choosing the Best Solution. http://www. pexpo.co.uk/c
ontentldownload/20646/3 53 74 7/f le/DemystifyingDataDedupe _W P. pdf,
White Paper, 2009-10-14,1-4.
[3] Mark W. Storer Kevin Greenan Darrell D. E. Long Ethan L. Miller.
2008. Secure Data Deduplication. StoraeSS'08, October 31, 2008,
Fairfax, Virginia, USA. 2008, 1-10.
[4] A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicaterecord detection:
A survey. Knowledge and Data Engineering,IEEE Trasactions on,
19:1-16, 2007.
[5] Y. Wag and S. Madnick. The Inter-Database Instance Identifcation
Problem in Integrating Autonomous Systems. In Proceedings of the
Fifh Interational Conference on Daa Engineering, pages 46-55,
Washington, DC, USA, 1989. IEEE Computer Society.
[6] http://www.linux-mag.comlid7535
[7] Medha Bhadkaar, Jorge Guerra, Luis Useche, Sam Burett, Jason
Lipta, Raju Rangaswami, ad Vagelis Hristidis. BORG: Block
reORGaization for Selfoptimizing Storage Systems. In Proc. of the
USENIX File and Storae Technologies, Februay 2009.
[8] Austin Clements, Irfan Ahmad, Murali Vilayanur, ad Jinyuan Li.
Decentralized deduplication in sa cluster fle systems. In Proc. of the
USENIX Annual Technical Conference, June 2009.
[9] Bolosky WJ,Corbin S,Goebel D,Douceur JR.Single instace storage in
Windows 2000.In:Proc.of the 4th Usenix Windows System
Symp.Berkeley: USENIX Association,2000. 13-24.
[10] Jorge Guerra, Luis Useche, Medha Bhadkaar, Ricado Koller, and
Raju Ragaswami. The Case for Active Block Layer Extensions. ACM
Operating Systems Review, 42(6), October 2008.
[II] Austin Clements, Irfan Ahmad, Murali Vilayanur, ad Jinyuan Li.
Decentralized deduplication in sa cluster fle systems. In Proc. of the
USENIX Annual Technical Conference, June 2009.
[12] http://www.snia.orglsearch?cx=001200299847728093177%3A3rwmjfd
m8ae&cofFORID%3AII&q=data+deduplication&sa=Go#994
[13] http://bbs.chinabyte.comlthread-393434-1-I.htl
[14] http://storage.chinaunix.netlstor/c/
[15] Austin Clements, Irfan Ahmad, Murali Vilayanur, ad Jinyuan Li.
Decentralized deduplication in sa cluster fle systems. In Proc. of the
USENIX Annual Technical Conference, June 2009.

You might also like