You are on page 1of 39

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The objectives for this module are shown here. Please take a moment to read them.

EMC Data Domain - 1

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The objectives for this lesson are shown here. Please take a moment to review them.

EMC Data Domain - 2

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Shown in the slide is a Data Domain deployment. A Data Domain system is a storage system
that deduplicates data on arrival. It has shelves of disks, and it has a controller. Its very
optimized, first to backup and second to archive applications, and supports most of the industry
leaders.
Data Domain easily integrates with the existing backup or archival environment. This includes
not only EMCs offerings with Networker but also Symantec, Commvault, and so on.
Data can be transferred into the Data Domain storage system, using either Ethernet or Fibre
channel. With Ethernet it can use mass protocols and NFS or CIFS, it can also use optimized
protocols, such as open storage, custom API with Symantec.net backup.
After the data is stored and its deduplicated during the storage process, it can replicate for
disaster recovery, replicating only the compressed deduplicated unique data segments that have
been filtered out through the right process on the target tier.

EMC Data Domain - 3

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

A typical backup environment without Data Domain involves writing backup data to tape. In
order to protect against disasters, the tapes must be shipped offsite. This is an expensive and
labor intensive task.

EMC Data Domain - 4

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

When Data Domain is implemented in a backup environment, data is written to disk instead of
tape. Disk provides faster performance than tape and has other characteristics that provide
protection. Data Domain is able to deduplicate data which reduces the size of the data footprint.
Instead of physically shipping tapes to remote warehouses, data can be transferred across the
network to a remote Data Domain system.

EMC Data Domain - 5

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

A Data Domain Appliance is a controller with its own disk array. The controller handles the
deduplication processing and other processes necessary. It runs on its own Data Domain
operating system. Double controllers are available in order to provide redundancy.

EMC Data Domain - 6

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Shown in the slide is the Data Domain family and details on their specifications.

Refer to the following link for the latest information on Data Domain models:
http://www.datadomain.com/images/products/Appliances-Table.jpg

EMC Data Domain - 7

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Components under high mechanical or electrical stress are protected under a N+1 redundancy.
This means that the components have at least one extra independent backup component. This
extra component is able to resume operations should a primary component fail. As shown in the
picture, extra fans and power supplies are included. RAID 6 protects against dual disk drive
failures.

EMC Data Domain - 8

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

One of the most conventional approaches to deduplication competing with Data Domain is using
whats known as a post process deduplication. In this architecture, data is stored to a disk before
deduplication, and then after its stored, its read back internally, deduplicated and written again
to a different area.
Although this approach may sound appealing, seeming as if it would allow for faster backups
and the use of less resources, it actually creates problems:
First, more disk is needed to store both the raw data temporarily and the deduplicated data. Post
Process deduplication also has an impact on speed because post process deduplication systems
are usually spindle-bound. There are typically three or four times more disks in a post-process
configuration than youll see in a Data Domain deployment.

An inline approach is also much simpler. If data is all filtered before its stored to disk, then its
just like a regular storage system: it just writes data; it just reads data. Theres no separate
administration involved in managing multiple pools, some with deduplication, some with
regular storage, managing the boundary conditions between them. Any less administration in the
storage system is always better. So by being simpler and smaller to provision, and in-line
approach and especially a CPU-centric in-line approach will always be more attractive.

EMC Data Domain - 9

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Within a Data Domain System, there are several levels of logical data abstraction above the
physical disk storage. Protocol namespaces, such as virtual tape libraries, EMC Data Domain
Boost, and CIFS/NFS shares act as an external interface to applications. A single Data Domain
may use any combination of these for storing and accessing data.
Files and directories for the namespaces are stored in the Data Domain filesystem. Non
CIFS/NFS data is stored under special directories.
A Unique segment collection is a collection of deduplicated data. It is here that sub-file objects
of about 8 KB are identified and deduplicated. Identical segments will be stored only once.
The last layer is the physical disk. Deduplicated data is stored on SATA disk drives and is RAID
6 protected.

EMC Data Domain - 10

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Stream Informed Segment Layout is the way that Data Domain approaches deduplication. It
provides deduplication in a highly efficient manner. Instead of being disk based, SISL uses a
CPU centric method. It does this by reducing the amount of times that disks need to be accessed.
In order to quickly identify segments, data is stored along with a fingerprint that represents the
data segment.
The Summary Vector is a data structure held in RAM. It is used to identify unique segments of
data. Almost all segments are identified through the Summary Vector. This saves the system
from doing a lookup in the on-disk index.
The Data Domain system stores neighboring segments of data together in a unit called Segment
Localities. These are held close together on disk. This way, consecutive data segments can be
accessed in a single disk access.

EMC Data Domain - 11

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

This slide shows how data is written to the Data Domain system using the SISL process. First,
data is stored in non-volatile RAM. Here it is broken into segments and fingerprints for each
segment are created. The fingerprint for each segment is compared to the Summary Vector. It
there are no matches, the segment and consecutive segments are compared to multiple segments
on disk. If the segment is unique, it is stored on disk.

EMC Data Domain - 12

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data is compressed in order to further reduce the capacity needed. This is done during the write
process. Compression options are shown on the slide.

EMC Data Domain - 13

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data Domain is designed using Data Invulnerability Architecture (DIA). DIA provides data
integrity and recoverability within the Data Domain system. Since data is deduplicated, a single
segment of data may be used across multiple files. If this segment were to become corrupted,
multiple files could become corrupt. This makes it crucial to ensure that data is intact.
There are four aspects of DIA. End to end verification is the process of ensuring that data has
been written correctly. After data is written to the system, it is checked against the original data
to make sure it was written correctly.
Fault avoidance and containment is used that data already on disk is not overwritten or
corrupted. This is accomplished using a special file system that does not overwrite old data.
Continuous fault detection and healing is a proactive process that continuously watches for
failures. RAID 6 and check sums are used to implement this.
Snapshots are used to provide file system recoverability. This protects against software and
hardware failure.

EMC Data Domain - 14

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Snapshots are a read-only copy of backup data. A snapshots is useful for saving a directory copy
at a specific point in time, where it can later be used as a restore point. The snapshot feature
creates a image of the Data Domain file system. This protects against both human and system
errors.

EMC Data Domain - 15

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

As an appliance, the Data Domain system automates all routine maintenance tasks. One of the
most important automated processes is the filesystem cleaning operation that must be scheduled
to reclaim physical storage occupied by deleted objects.
When application software expires backup or archive images, they are deleted in the sense that
they are no longer accessible or available for recovery from the application. However, the
images still occupy physical storage. Only a clean operation reclaims the segments used by files
that are deleted and are no longer referenced.
Cleaning can require a lot of system resources while it is occurring. Mechanisms are in place to
automatically adjust the priority assigned to cleaning tasks in favor of more time critical
processing tasks. Cleaning schedules are adjustable. By default, cleaning is scheduled to start
every Tuesday at 6:00am.

EMC Data Domain - 16

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Cleaning provides the opportunity to reorganize the data to improve the speed and efficiency of
deduplication.
Data invulnerability requires that data is always only written into new containers, and this
requirement also applies to the cleaning process. Copy forward segments are segments that for
read efficiency should be stored adjacent to each other and so they are copied forward together
into a single container.
Dead segments are dead because the files that referred to them have all been deleted, and the
pointers have been removed. Dead segments are not allowed to be re-written with new data
since this could put valid data at risk of corruption. Instead valid segments are copied forward
into free containers to group the remaining valid segments together. When the data is safe and
reorganized the original containers are appended back onto the available disk space.

EMC Data Domain - 17

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Administrators need to understand how to configure and monitor the reports and logs for error
conditions. Data Domain systems provide access to the following types of reports and logs that
provide information about error conditions:
Autosupports and Alerts can be sent by email. Autosupport sends a daily email to Data Domain
Support containing various log files and other system information. This allows Data Domain
Support to quickly be informed of any issues that may arise in the Data Domain system.
Syslog can be configured to publish logs, alerts, and messages. SNMP can also be configured to
send a subset of alerts as traps to third-party SNMP managers.

EMC Data Domain - 18

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The autosupport email list is used in two ways: send a daily detailed report on a specified
schedule and send a daily alerts summary about non-critical hardware situations and disk space
usage numbers that should be addressed soon.
The autosupport command can also be used to send the output of a specific command or the
contents of a file to the distribution list.
By default, Data Domain systems send daily autosupport reports to Data Domain tech-support
via email using SMTP. The autosupport report contains system configuration information, alerts
summaries, performance statistics and system messages.
By default, Data Domain systems are also configured to send daily alerts to the autosupport list
that notify Data Domain tech-support about non-critical error messages or warnings about
problems on the system that should be fixed as soon as possible.
Customers have the option to configure who receives autosupports and alerts and the time they
are sent.
For how to configure autosupports and alerts, see the autosupport and alerts command
descriptions and options in the DD OS Command Reference Guide.

EMC Data Domain - 19

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Alerts are sent with either a Warning or Critical severity. For example, a Warning alert is sent
when a fan fails. When alerted, customer support contacts the owner to arrange a replacement.
Warning alerts are sent when a non-critical system problem is detected. This type of problem
should be fixed as soon as possible. The warning is sent to the autosupport email list as soon as
the problem occurs. Warnings are also included in the Daily Alert Summary and with the
Autosupport Summary.
Critical alerts are sent when a sever problem occurs that should be fixed immediately. They are
sent to the alerts email list as soon as the problem occurs.

EMC Data Domain - 20

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The objective for this lesson is shown here. Please take a moment to review it.

EMC Data Domain - 21

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Replication is used to protect against disaster. This is accomplished by sending data from one
Data Domain to another over the network. In a Data Domain system, only unique data is
replicated. This is made possible because of the deduplication process. This saves enormous
amounts of bandwidth since only a small portion of data stored will be changed. Since not as
much data is transferred, the replication window is reduced.
There are three types of replication. These will be discussed on the following slides.

EMC Data Domain - 22

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Collection replication is the transfer of all backup data. It is able to replicated along with all
backup and recovery functions. Data at the target is accessible immediately. In addition to data,
user accounts and passwords are also replicated as are snapshots. Only a one-to-one
configuration is allowed.

EMC Data Domain - 23

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Directory replication is the transfer of individual directories on the Data Domain system. A Data
Domain system can be a source or destination for multiple directories and can also be a source
for some directories and a destination for others. Many topologies are supported with directory
replication. Normal backup and restore operations are still able to be performed during
replication.

EMC Data Domain - 24

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Pool replication is a type of directory replication that replicates directories that contain VTL tape
cartridges. Virtual tape libraries use a structure called storage pools within the Data Domain.
This data which is sent to the virtual tape can be replicated. Only one VTL license is required for
the source.

EMC Data Domain - 25

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

One way to send data to the Data Domain system is through the use of CIFS or NFS shares.
CIFS can be used by Windows clients while NFS is used by UNIX based operating systems. A
directory within the /backup directory is shared out to the client. When data is sent to the shared
directory, it is deduplicated and stored automatically.

EMC Data Domain - 26

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

OpenStorage server software, which is a feature of Symantecs Veritas NetBackup, integrates


NetBackup with Data Domain system disk backup devices. It allows NetBackup media servers
to communicate with disk devices without emulating tape. In order to enable OST software, a
plugin must be installed on the NetBackup media server in order to integrate with Data Domain.
The Data Domain then creates Logical Storage Units which are used as NetBackup storage
servers.

EMC Data Domain - 27

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Using the Data Domain VTL feature, backup applications can connect to and manage Data
Domain as if it were a tape library. In this configuration, Data Domain creates virtual tapes that
will act as real SCSI tape drives. Tapes an pools can be replicated to other Data Domain systems
for disaster recovery. Tapes can also be locked with retention to prevent them from premature
deletion. The VTL feature can be used simultaneously with the other interfaces.

EMC Data Domain - 28

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data Domain Boost is an option that distributes part of the deduplication process out of the Data
Domain system and onto the backup server. This makes the backup network more efficient, it
makes Data Domain systems 50% faster, and it makes the whole aggregate system more
manageable. It works across the entire Data Domain product line.
As shown in the diagram on the slide, the segmentation, identification, and compression is
handled n the backup server instead of on the Dat Domain system. This means that only the
unique segments are sent over the network.

EMC Data Domain - 29

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The Data Domain retention lock licensed software feature enables organizations to protect
records in non-writeable and non-erasable formats for a specified length of time up to 70 years.
This means that although the protected data can be read, it cannot be modified or deleted until
the retention period has expired. This can be used in order to protect against accidents and user
errors. And also malicious activity. For example, a Data Domain system may be used to store
email records. A malicious person may attempt to delete some incriminating emails, but would
be unable to do so if the retention has not expired.
Retention minimums and maximums can be set globally for the Data Domain system. For
example, it can be configured so that all files must have a retention of at least 5 years. Retention
values can be set on a file by file basis.

EMC Data Domain - 30

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

With the sanitize function, deleted files can be overwritten using a DoD/NIST compliant
algorithm and procedures.
No complex setup or disruption is needed. Sanitizing is electronic equivalent of data shredding;
it removes any trace of deleted files.
This feature is designed primarily to support the needs of organizations that are required to
remove and destroy confidential data if it was accidentally written to an unapproved system or
to delete data that is no longer required.

See the Electronic Data Shredding Technical brief at


http://www.datadomain.com/pdf/TechBrief-ElectronicShredding.pdf

EMC Data Domain - 31

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

With Data Domain Encryption licensed software option enabled, all incoming data is encrypted
inline before it is written to disk. This is also referred to as encryption at rest. This improves
security by preventing data from being read directly from disk without being first decrypted by
the system. Data Domain implements software-based encryption, so no additional hardware is
required. Encryption is transparent to the access protocols. Because of this, no change is needed
in configuring the rest of the environment to deploy encryption at rest.

EMC Data Domain - 32

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The objective for this lesson is shown here. Please take a moment to review it.

EMC Data Domain - 33

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data Domain systems can replace both the large staging disk and the tape system. Replication
across the WAN is built into the Data Domain systems instead of requiring a separately managed
function of the primary storage. Configuration of the backup software such as the Oracle
Recovery Manager (RMAN) does not need to be changed; simply point the backup application
at the Data Domain storage as a replacement for the previous NFS, CIFS, or VTL device.
Copies of the data needed for longer term archive or compliance can continue to be written to
tape either onsite or at the offsite disaster recovery site.

EMC Data Domain - 34

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Backup and recovery for a Microsoft Exchange Server environment is a mission critical function
that benefits from all of the advantages of replacing tape based systems with Data Domain
appliances. In addition to being storage for the typical Exchange backups, Data Domain systems
can also be used as an efficient storage repository for email archiving applications. Instead of
email archives being stored on a separate system, the archives can be written to the same Data
Domain system that is storing the Exchange database backups.
The significant amount of duplicate data found in both the Exchange backups and in email
archive files is deduplicated across both data sets, to reduce the storage footprint even more.
Without Data Domain, different interface or file protocol support needs of the Exchange backup
server and the email archive server may have prevented these from backing up to the same
device. Being able to use CIFS, NFS, and VTL simultaneously to access a single Data Domain
system opens up many new possibilities for combining data from different sources to take
advantage of the savings from deduplication.

EMC Data Domain - 35

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

VMware sites tend to create more data to manage and protect than their physical counterparts.
Making it simpler to multiply servers tends to increase the storage footprints. The operational
flexibility offered by being able to have multiple copies and variants of a virtual image with
various configurations comes at the expense of needing to buy more storage to back up and
protect these images. Since many of the elements are the same between virtual images, they tend
to deduplicate very well when stored on Data Domain systems. Deploying a system at the
disaster recovery site allows for replication of critical VM images that can be kept up to date and
ready to assume operation immediately in a disaster.
Data Domain systems are attached to the high capacity backbone network used for storing and
moving the VM images. Installation and configuration is similar whether the system is being
used with VMware infrastructure, third party enterprise backup software, or specialized
VMware backup applications.

EMC Data Domain - 36

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

In this example, a nearline implementation is used to handle some version control software that
is using a Data Domain system as storage. The software tracks changes to documents as they are
being updated. Since file differences are usually minor, the opportunity for deduplication is
large. Data does not need to be accessed frequently, but needs to be immediately available for
the times that it is accessed.

EMC Data Domain - 37

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data Domain is also useful in an archive situation. This example stores mostly static files. Files
are not read back frequently but access to files needs to be immediate. This example uses a CIFS
share to implement the solution.

EMC Data Domain - 38

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

These are the key points covered in this module. Please take a moment to review them.

EMC Data Domain - 39

You might also like