Professional Documents
Culture Documents
Second Revision
Abstract
This guide introduces the NetApp A-SIS deduplication technology and describes in detail how to
implement and utilize it.
It should prove useful for customers requiring assistance in understanding and architecting solutions
with A-SIS deduplication and NetApp storage systems.
Table of Contents
1 Introduction............................................................................................................1
1.1 Intended Audience...................................................................................................... 1
1.2 Purpose....................................................................................................................... 1
1.3 Prerequisites and Assumptions ................................................................................. 1
1.4 Document Conventions.............................................................................................. 1
2 Overview.................................................................................................................2
2.1 NetApp Deduplication Technologies ......................................................................... 2
2.1.1 SnapVault for NetBackup™.................................................................................................. 3
2.1.2 A-SIS Deduplication .............................................................................................................. 3
2.2 Dense Volumes .......................................................................................................... 3
2.3 A-SIS Features and Functions................................................................................... 4
2.3.1 General A-SIS Operational Considerations ......................................................................... 5
3 Configuration and Operation ...............................................................................6
3.1 Requirements Overview............................................................................................. 6
3.2 Installing and Licensing A-SIS ................................................................................... 6
3.2.1 A-SIS Licensing in a Clustered Environment....................................................................... 7
3.3 Command Summary .................................................................................................. 7
3.4 A-SIS Quick Start Guide............................................................................................. 8
3.5 Monitoring A-SIS Status ............................................................................................. 8
3.6 End-to-End A-SIS Configuration Example .............................................................. 10
3.7 Configuring A-SIS Schedules .................................................................................. 14
4 Operating Characteristics ..................................................................................16
4.1 A-SIS Target Environment ....................................................................................... 16
4.2 A-SIS Performance .................................................................................................. 16
4.3 A-SIS Storage Savings............................................................................................. 16
4.4 Additional A-SIS Considerations.............................................................................. 16
4.4.1 Number of A-SIS Processes............................................................................................... 17
4.4.2 A-SIS and Active/Active Configuration ............................................................................... 17
4.4.3 A-SIS and Space Savings on Existing Data ...................................................................... 17
4.4.4 A-SIS Best Practices ........................................................................................................... 18
1 Introduction
1.2 Purpose
The purpose of this paper is to present a guide for implementing NetApp A-SIS deduplication. It will
address step-by-step configuration examples, introduce known caveats and recommendations to
assist the reader in designing optimal solutions, and prepare the audience for performing
deployments of the technology in customer environments.
Its use is threefold:
Provide detailed information to all interested parties.
Educate prior to performing deployments.
Serve as a reference for resolving issues that could arise.
This document is not:
A sales guide (although some high-level thoughts are covered in the “Solutions Overview”
section)
A competitive comparison
A complete product design document
2 Overview
This section provides a quick overview of deduplication in general and then introduces what A-SIS
deduplication is and how it works at a high level.
While all these technologies offer the benefit of reducing the amount of required storage, in the
marketplace they are often not considered “deduplication” technologies when compared to solutions
offered by other vendors. That sentiment, while not entirely accurate, is understood, and NetApp
continues to expand its portfolio with several technologies for further deduplication of data. The
following subsections cover two of the solutions that are available as of the writing of this paper;
additional deduplication technologies are coming in both the short term and the more distant future.
Before delving into technical solutions, it makes sense to understand the value of deduplication to
customers. The primary advantage of data deduplication is that it conserves physical disk space
when storing data on disk. The average UNIX® or Windows® disk volume contains thousands of
duplicate data strings. Traditionally, when copies of these volumes are created, every duplicate data
string is also copied, resulting in an inefficient use of secondary storage. Deduplication helps to
remove this inefficiency and yields a more effective cost per gigabyte in the data center.
To keep track of the many indirect blocks (“IND” in Figure 2) that are pointing to it, each data block
has a block count reference kept in the volume metadata. As additional indirect blocks point to it or
existing ones stop pointing to it, this value is incremented or decremented accordingly. When no
indirect blocks point to a data block, it is released.
A-SIS uses dense volume technology to allow duplicate blocks anywhere in the flexible volume to be
deleted.
Essentially, A-SIS only stores unique blocks in the flexible volume and creates a small amount of
additional metadata in the process. Notable features include:
NetApp A-SIS deduplication operates with a high degree of granularity, at the block level.
It operates on the active file system of the flexible volume. Snapshot copies created after
running A-SIS enjoy the same storage savings benefits.
A-SIS is a background process that can be configured to run automatically, scheduled, or run
manually through the command-line interface.
A-SIS is application transparent and therefore can be used for deduplication of data
originating from anywhere in the data center.
A-SIS is enabled and managed using a simple command-line interface.
A-SIS can be enabled on and deduplicate blocks on flexible volumes with existing data too.
The remainder of this document goes into great detail on the operation of A-SIS, but in general the
following occurs:
Newly saved data on the NearStore is stored in blocks as usual by Data ONTAP. Each block
of data has a digital fingerprint, which is compared to all other fingerprints in the flexible
volume. If two fingerprints are found to be the same, a byte-for-byte comparison is done of all
bytes in the block, and, if there is an exact match between the new block and the existing
block on the flexible volume, the duplicate block is discarded and its disk space is reclaimed.
sis status [-l] <vol> Returns current status of A-SIS for the specified flexible
volume.
The -l option causes a long listing to be displayed.
df –s <vol> Returns the value of A-SIS space savings in the active file
system for the specified flexible volume.
sis off <vol> Deactivates A-SIS on the flexible volume specified. This
means there will be no more change logging or
deduplication operations, but the flexible volume will
remain a dense volume, and the storage savings will be
kept.
If this command is used, and then A-SIS is turned back on
for this flexible volume, the flexible volume will need to be
rescanned with the ”sis start –s” command.
sis check <vol> Verifies and updates the fingerprint database for the
specified flexible volume and includes purging stale
fingerprints.
sis stat <vol> Displays the statistics of flexible volumes that have A-SIS
enabled.
Create, Modify, Delete or modify the default A-SIS schedule that was configured when A-
Delete SIS was first enabled on the flexible volume or create desired schedule.
Schedules (if sis config [-s sched] <vol>
not doing
manually)
Manually Run sis start <vol>
A-SIS (if not
using
schedules)
Monitor Status sis status <vol>
of A-SIS
Below, from the sis man page, you see the various State, Status, and Progress messages that can
be returned when running sis status. Note that if you don’t provide a flexible volume name, the
status for all flexible volumes that have A-SIS enabled will be displayed.
toaster> sis status
Path State Status Progress
/vol/dvol_1 Enabled Idle Idle for 10:45:23
/vol/dvol_2 Enabled Pending Idle for 15:23:41
/vol/dvol_3 Disabled Idle Idle for 37:12:34
/vol/dvol_4 Enabled Active 25 GB Scanned
/vol/dvol_5 Enabled Active 25 MB Searched
/vol/dvol_6 Enabled Active 40 MB (20%) Done
/vol/dvol_7 Enabled Active 30 MB Verified
/vol/dvol_8 Enabled Active 10% Merged
And following is a textual description of the meaning for each flexible volume:
dvol_1 is Idle. The last A-SIS operation on the flexible volume was finished 10:45:23 ago.
dvol_2 is Pending for resource limitation. The A-SIS operation on the flexible volume will
become Active when the resource is available.
dvol_3 is Idle because the A-SIS operation is disabled on the flexible volume.
dvol_4 is Active. The A-SIS operation is doing the whole flexible volume scanning (initiated
with “sis start –s”). So far, it has scanned 25GB of data.
dvol_5 is Active. The operation is searching for duplicate data, and 25MB of data has already
been searched.
dvol_6 is also Active. The operation has saved 40MB of data. This is 20% of the total
duplicate data found in the searching stage.
dvol_7 is Active. It is verifying the metadata of processed data blocks. This process will
remove unused metadata.
dvol_8 is Active. Verified metadata are being merged. This process will merge together all
verified metadata of processed data blocks to an internal format that supports fast sis
operation.
The general flow of the phases A-SIS goes through and the correlating sis status messages
when actively running on a flexible volume are shown in Figure 4.
For additional information, the -l option will display detailed status, as shown below.
toaster> sis status -l /vol/dvol_6
Path: /vol/dvol_6
State: Enabled
Status: Active
Progress: 41020 KB (20%) Done
Type: Regular
Schedule: sun-sat@0
Last Operation Begin: Thu Mar 24 13:30:00 PST 2005
Last Operation End: Fri Mar 25 00:34:16 PST 2005
Last Operation Size: 4732932 KB
Last Operation Error: -
1. Begin by creating a flexible volume (keeping in mind the maximum allowable volume size for the
platform, as specified in the requirements table at the beginning of this section).
r200-rtp01*> vol create VolPST aggr0 200g
Creation of volume 'VolPST' with size 200g on containing aggregate
'aggr0' has completed.
2. Now, as a best practice, we’ll disable scheduled Snapshot copies. An alternative to what’s shown
below would be to use the command “snap sched VolPST 0 0 0”.
r200-rtp01*> vol status VolPST
Volume State Status Options
VolPST online raid_dp, flex
Containing aggregate: 'aggr0'
r200-rtp01*> vol options VolPST nosnap true
r200-rtp01*> vol status VolPST
Volume State Status Options
VolPST online raid_dp, flex nosnap=on
Containing aggregate: 'aggr0'
3. Now we’ll enable A-SIS on the flexible volume and verify that it’s turned on. The vol status
command will show a sis attribute for flexible volumes that have A-SIS turned on. (It can be a bit
confusing, since sis is also indicated for those flexible volumes that have been written to by
SnapVault for NetBackup.)
Note that there needs to be space available in the flexible volume for the sis on command to
complete successfully. That is, if the sis on command were attempted on a flexible volume that
already had data and was completely full, it would fail (since there is no room to create the
required metadata).
Note that after turning A-SIS on, Data ONTAP lets you know that if this were an existing flexible
volume that already contained data prior to A-SIS being enabled, you would want to run sis
start –s; in this example it’s a brand-new flexible volume, so that’s not necessary.
4. Another way to verify that A-SIS is enabled on the flexible volume is to just check the output from
running sis status on the flexible volume.
r200-rtp01*> sis status /vol/VolPST
Path State Status Progress
/vol/VolPST Enabled Idle Idle for 00:00:20
5. Next we’ll turn off the default A-SIS schedule. Since in this example the administrators will be
moving large quantities of PST files in as time permits, we’ll want to let them run A-SIS manually
at opportune times.
r200-rtp01*> sis config /vol/VolPST
Path Schedule
/vol/VolPST sun-sat@0
r200-rtp01*> sis config -s - /vol/VolPST
r200-rtp01*> sis config /vol/VolPST
Path Schedule
/vol/VolPST -
At this point, in our example, the administrator NFS-mounted the flexible volume to /testPSTs on a
Solaris™ host, sunv240-rtp01, and copied lots of PST files from their users’ directories into our
new PST archive directory flexible volume. The result from the host perspective is shown below.
(Obviously the same sort of thing could be accomplished by mapping a CIFS share to a Windows
host.)
root@sunv240-rtp01 # pwd
/testPSTs
root@sunv240-rtp01 # df -k .
Filesystem kbytes used avail capacity Mounted on
r200-rtp01:/vol/VolPST
167772160 33388384 134383776 20% /testPSTs
The example continues with examining the flexible volume, running A-SIS deduplication, and
monitoring the status.
6. Use df –s to examine the storage consumed and the space savings provided. Note that no space
savings have been achieved by simply copying data to the flexible volume even though A-SIS is
turned on. What has happened is that all the blocks that have been written to this flexible volume
since A-SIS was turned on have had their fingerprints written to the change log file.
r200-rtp01*> df -s /vol/VolPST
Filesystem used saved %saved
/vol/VolPST/ 33388384 0 0%
7. Start A-SIS running on the flexible volume. This causes the change log to be processed,
fingerprints to be sorted and merged, and duplicate blocks to be found.
r200-rtp01*> sis start /vol/VolPST
The SIS operation for "/vol/VolPST" is started.
9. Once sis status indicates the flexible volume is once again in the Idle state, A-SIS has finished
running, and we can now check the space savings it provided in the flexible volume.
r200-rtp01*> df -s /vol/VolPST
Filesystem used saved %saved
/vol/VolPST/ 24072140 9316052 28%
Run with no arguments, sis config will return the schedules for all flexible volumes that have A-
SIS enabled. The example below shows the four different formats the reported schedules can have.
toaster> sis config
Path Schedule
/vol/dvol_1 -
/vol/dvol_2 23@sun-fri
/vol/dvol_3 auto
/vol/dvol_4 sat@6
When the -s option is specified, the command will set up or modify the schedule on the specified
flexible volume. The schedule parameter can be specified in one of four ways:
[day_list][@hour_list]
[hour_list][@day_list]
-
auto
The day_list specifies which days of the week A-SIS should run. It is a comma-separated list of the
first three letters of the day: sun, mon, tue, wed, thu, fri, sat. The names are not case sensitive.
Day ranges such as mon-fri can also be given. The default day_list is sun-sat.
The hour_list specifies which hours of the day A-SIS should run on each scheduled day. The
hour_list is a comma-separated list of the integers from 0 to 23. Hour ranges such as 8-17 are
allowed.
Step values can be used in conjunction with ranges. For example, 0-23/2 means "every two hours."
The default hour_list is 0 (that is, midnight on the morning of each scheduled day).
If "-" is specified, there won't be a scheduled A-SIS operation on the flexible volume.
The “auto” schedule causes A-SIS to run on that flexible volume whenever there are 20% new
fingerprints in the change log. This check is done in a background process and occurs every minute.
When A-SIS is enabled on a flexible volume the first time, an initial schedule is assigned to the
flexible volume. This initial schedule is sun-sat@0, which means "once every day at midnight."
To configure the schedules shown earlier in this section, the following commands would be issued:
toaster> sis config -s - /vol/dvol_1
toaster> sis config -s 23@sun-fri /vol/dvol_2
toaster> sis config –s auto /vol/dvol3
toaster> sis config –s sat@6 /vol/dvol_4
4 Operating Characteristics
This section discusses where A-SIS makes sense and the behavior that you can expect.
If there is very little new data, run A-SIS infrequently, because it doesn't make sense to
unnecessarily consume CPU resources. How often you run it will depend on the change rate
of the data in the flexible volume.
The best options are:
Use the auto mode so that A-SIS only runs when significant additional data has
been written to each particular flexible volume (this will tend to naturally spread out
when A-SIS runs).
Stagger A-SIS schedules for the flexible volumes so it runs on alternative days.
Run A-SIS manually.
Run A-SIS before creating Snapshot copies, as this will ensure no undeduplicated data gets
locked in Snapshot copies. If a Snapshot copy is created on a flexible volume before A-SIS
has a chance to run/complete on that flexible volume, this could result in lower space
savings.
The Snapshot reserve should be greater than 0 if Snapshot copies are to be used. (An
exception to this might be in a SAN environment, where often it is set to zero for thin
provisioning of LUNs.)
There must be some free space in the flexible volume to allow A-SIS to operate and create
the metadata it requires. As necessary, flexible volumes can be resized, with no impact to
data access, to accommodate this.
5.1 Licensing
Make sure A-SIS is properly licensed and, if the platform is not an R200, make sure the NearStore
option is also properly licensed:
fas3070-rtp01*> license
…
a_sis <license>
nearstore_option <license>
…
Also note that there needs to be free space available in the flexible volume for the “sis on”
command to complete successfully. If a flexible volume is full, A-SIS will not run. However, as noted
earlier, flexible volumes can be resized with no impact to data access to accommodate this.
1
Note that the undo option of the sis command is only available in the diag mode, accessed using
the command “priv set diag”.
Note that if sis undo starts processing and then there is not enough space to undeduplicate, it will
stop, complain with a message about insufficient space, and leave the flexible volume dense. All data
is still accessible, but some block sharing is still occurring. Use “df –s” to understand how much free
space you really have and then either grow the flexible volume or delete data or Snapshot copies to
provide the needed free space.
© 2007 Network Appliance, Inc. All rights reserved. Specifications subject to change without notice. NetApp, the Network Appliance logo, Data
ONTAP, FlexClone, FlexVol, NearStore, SnapMirror, SnapVault, and WAFL are registered trademarks and Network Appliance and Snapshot are
trademarks of Network Appliance, Inc. in the U.S. and other countries. Solaris is a trademark of Sun Microsystems, Inc. Windows and Microsoft are
registered trademarks of Microsoft Corporation. UNIX is a registered trademark of The Open Group. NetBackup is a trademark of Symantec
Corporation or its affiliates in the U.S. and other countries. All other brands or products are trademarks or registered trademarks of their respective
holders and should be treated as such.
24