You are on page 1of 13

White Paper

Exploring Successful Approaches


to Test Data Management

A White Paper by Bloor Research


Author : Philip Howard
Publish date : July 2012

TDM improves the quality and


accuracy of testing and supports
agile development where errors
are caught early, fixed more
quickly and results in fewer defects
Philip Howard

Exploring Successful Approaches to Test Data Management

Executive summary
Test Data Management (TDM) is about the provisioning of data for non-production environments, especially for test purposes but also
for development, training, quality assurance,
demonstrations or other activities. Test data
has always been required to support application development and other environments
but, until the relatively recent advent of TDM,
this has been achieved in an ad hoc manner
rather than in any formalised or managed way.
The predominant technique has been copying
some production data or cloning entire production databases.
Copying data or cloning production databases
has a number of drawbacks. To begin with:
how do organisations manage costs? Copying
production data or cloning production databases is largely a manual process. Moreover
it requires a whole new set of hardware and
software, along with the licenses to match,
and must be duplicated across each testing
environment. As a result you can easily end
up with multiple copies of the same database
across multiple projects in development.
Alternatively, you can have multiple development teams sharing the same test database
but sharing means that there is often contention for resources, resulting in extended
delivery schedules. Indeed, getting access to
the right data at the right time can be a major
issue regardless of whether you share test
data or have multiple test databases. It takes
time to generate (copy or clone) new datasets
because databases administrators, who typically complete this task, have other priorities.
Test teams often have to wait several days, or
even longer, to get test data. In agile environments, in particular, manual copying or cloning can slow development.
Unfortunately, there are additional challenges
with copying production data or cloning production databases. If the application processes any
sensitive data, such as personally identifiable
information (PII), personal health information
(PHI), credit card information, national identifiers or any company confidential information,
then it will be necessary to protect this data
so that developers or testers cannot see it. In
most cases, simply copying production data or
cloning the production database is not legally
viable since sensitive data is likely to be exposed. It is best to apply an appropriate data
protection technique such as data masking
which we will discuss later in this paper.

A Bloor White Paper

Another approach to test data creation is for


developers or testers to create their own.
This doesnt introduce the same security
concerns as copying production data or cloning production databases but there are two
main problems. First, it is difficult to ensure
that the test data accurately represents the
production data. Test data needs to accurately
reflect the relationships that exist between
different elements in the database, but it is
often the case that these relationships are not
explicitly defined. Unfortunately, relying on the
database schema to help build test data (which
is the normal approach) will not capture all the
relevant detail. Second, because it is a hassle
to keep creating new test datasets, fresh,
accurate test data is not available even though
the data model may have changed. As a result,
accurate tests cannot be executed until later
in the development process when errors are
most expensive to correct. Also, agile development environments require test data to be as
nimble as the development process itself. This
is a problem when developers and testers have
to take steps to manually create test data.
TDM introduces an automated way of generating appropriate test data that can support
agile environments and protect sensitive data.
While there is obviously a licensing cost for
TDM solutions, they increase testing agility,
reduce risk, ensure thorough testing against
an appropriately broad range of tests, decrease
hardware and software costs, provide compliance through data protection, and reduce
development cycles so that new applications
can be brought into production faster and with
fewer bugs.
Having outlined why TDM is important we will
next discuss what you need to think about
when considering a TDM solution and also
best practices. We will then briefly consider
IBMs approach to TDM (InfoSphere Optim
Test Data Management solution) and discuss
some real-world examples of its use across a
number of companies.

2012 Bloor Research

Exploring Successful Approaches to Test Data Management

Pros and cons of typical test data creation approaches


There are essentially three ways to provision
test data:
1. Copy production databases and, where appropriate, de-identify any sensitive data
2. Subset production data and, where appropriate, de-identify any sensitive data
3. Generate synthetic data based on an understanding of the underlying data model,
which requires no data de-identification.
Each of these approaches has advantages
and disadvantages, and we will discuss each
in turn.
Database cloning
Copying production databases has the advantage of being relatively simple. However,
it is expensive in terms of hardware, license
and support costs to have multiple copies
of the same database, and multiple copies
of the same database increase governance
concerns. It is not unusual for companies with
large development shops to have upwards of
twenty different copies of the same database
for development and testing.
Another issue with cloning is that it is time
consuming; large databases will take a long
time to copy and test cases take longer to
run because of large data volumes. Cloning
doesnt promote agile development processes
since testers and developers cant refresh
data and it is impossible to create targeted test
datasets for specific test cases or validate the
data after test runs. In addition, cloning does
not enable collaboration between database
administration and testing teams, and it is not
scalable across multiple data sources or applications. Finally it is risky. Cloning makes it
more difficult to ensure that sensitive data is
properly masked to ensure compliance. A further downside is that, in 24x7 environments,
cloning a production database may have an
impact on performance. The same is true,
though to a lesser extent, for sub-setting.
Database sub-setting
Sub-setting your production databases is less
expensive. However, it suffers from the same
problems as all sampling processes in that it
can miss outliers. This is particularly important in development and testing environments
because outliers may cause the system to

2012 Bloor Research

break whereas normal results do not. Therefore it is important that outliers be properly
tested. If you are using sub-setting then you
need to ensure that outliers are captured and
represented within the sub-setting process.
This means using a solution that can capture
the full range of production data rather than
simply randomly picking some sub-set of the
data.
An issue that occurs with both full and partial database copies or sub-setting directly
from production is that sensitive data may
be exposed. This sensitive data needs to be
protected as mandated by data privacy laws or
compliance requirements. De-identifying test
data is not a simple process and requires an
understanding of what data may be hidden.
You need to first discover where sensitive
data resides, classify and define datatypes,
and determine metrics and policies to ensure
protection over time. Data can be distributed over multiple applications, databases and
platforms with little documentation, and it is
often the case that relevant information is built
into application logic so that both implicit and
explicit relationships exist. This is a danger
because organisations may rely too heavily on system and application experts who are
familiar with the latter but not the former. In
practice, finding sensitive data and discovering
data relationships requires careful analysis.
Data sources and relationships should be
clearly understood and documented so no
sensitive data is left vulnerable. Only after
understanding the complete landscape can
you define proper enterprise data security and
privacy policies and, while this can be done
manually, it is an onerous process that is better automated.
So how do you protect the data? Typically, TDM
solutions provide masking capabilities. How
the masking is accomplished is somewhat
dependent on why you are doing the masking.
For example, you could simply hide a credit
card number by replacing each digit with an x
(xxxx-xxxx-xxxx-xxxx). This approach is fine if
you are only concerned with data protection.
However, if you want to test a payment application then you will need to work with contextually accurate numbers. Similarly, simple
shuffling techniques (for example, replacing
zip code 12345 with 54321) will not work if your
application requires a valid zip code. For test
data management you will need to mask in
such a way that the data remains valid.

A Bloor White Paper

Exploring Successful Approaches to Test Data Management

Pros and cons of typical test data creation approaches


Further, it is not simply a question of understanding what data needs to be masked and
then masking it. You need to ensure that data
relationships remain intact during the masking process or application testing may break
down. For example, a patient is being treated
by a particular consulting physician for a particular disease at a specific hospital using a
designated operating theatre. If you scramble
the data so that a patient with the flu ends up
having open heart surgery then your software
may break down simply because your masking routines have not ensured that important
relationships remain intact. So, discovery of
these relationships is essential. Moreover,
this situation is exacerbated if your application is going to span multiple data sources. As
heterogeneous environments proliferate, this
is becoming more and more of an issue. You
need to understand relationships across databases, not just within a single database.
Keep in mind that masking is never perfect.
In healthcare environments, to continue the
preceding example, a determined hacker may
still be able to identify individuals, precisely
because of the need to retain relationships.
Synthetic data generation

Additional testing challenges


Finally, another major challenge of test data
creation and, indeed, testing in general, is coverage. What you would really like to achieve is
testing of every possible code path with every
possible combination of data with a minimum
of tests. Unfortunately, that is very far from
reality. Cloning a database, for example, often
supports no more than 30% coverage and often much less. There are ways to improve this
percentage and reduce duplicated testing but
what you really want is a truly representative
dataset without duplication. When used with
appropriate tools it should be possible to get
as much as 100% functional coverage and 90%
code coverage with an absolute minimum of
tests. This is much more thorough than typical
development environments (where 50% coverage is nearer the norm) and should result in
the production of better code in less time and
at less cost, because of the reduced number of
tests that need to be run.
The best test data management solutions will
address the following key questions:

The third way to create test data is to generate


synthetic data. Synthetic test data generation
makes it somewhat easier because there is no
requirement for data de-identification. Also,
since no access to production data is needed
there is no impact on production performance.
However, in order to generate representative
synthetic data you do need to have a good
understanding of the data relationships that
are not only embedded within the database
schema (or file system) but also those relationships that are implicit within the data. In
addition, you also need to understand relationships that are not formally detailed within the
schema. In other words you need some sort
of discovery process. Overall, while there is a
definite requirement for this type of test data
management approach, it represents a relatively small percentage of the whole market.
Typically the only organisations requiring this
approach will implement synthetic date creation because they have no other alternatives.
They tend to fall into one of three categories:
completely new environments where there is
no production data to use; extremely secure
environments where so much of the data is
sensitive that it would be very costly to mask all
of the data; and highly available environments

A Bloor White Paper

where performance is critical and where the


impact on the production environment from
sub-setting is deemed to be too high.

1. How representative is the test data?


2. How agile is the test data?
3. How easy and comprehensive is the data
masking?

2012 Bloor Research

Exploring Successful Approaches to Test Data Management

Best practices for test data management


In this section we will consider how best in class solution meets the
requirements of the three key questions we have just identified.
How representative is the data?
As we have discussed, you wont get a representative dataset if you simply subset the data randomly, which is an approach used by some TDM
vendors. To start with you need to discover the business relationships
and business entities that exist within the data. These business entities should be leveraged and understood by the Test Data Management
solution to accelerate the sub-setting of the data and to ensure that the
test data is most closely aligned to, and representative of, real data. Part
of this discovery process should also include finding and understanding
sensitive information so it can be properly protected.
It is important to stress just how vital this understanding of business
entities is. A business entity, a conceptual view of which is illustrated
in Figure 1, provides a complete picture of that entity: for example, a
customer record including delivery addresses, order number, payment
history, invoices and service history. Note that such an entity may span
multiple data sources: even if your new application will only run against
one of these you will still need a broader understanding if you are to generate representative data. Further, it is important to be aware that business entities cannot be understood by simply turning to the data model
because many of the relationships that help to define a business entity
are actually encoded in the application layer rather than in any database
schema, and such relationships need to be inferred by tools such as IBM
InfoSphere Discovery.

Figure 1: Conceptual view of a business entity

2012 Bloor Research

A Bloor White Paper

Exploring Successful Approaches to Test Data Management

Best practices for test data management


Providing agile test data
Much of todays development takes place in an agile environment. Here
you will repeatedly need new sets of test data. A TDM solution should
enable testers and developers to refresh test data to ensure they are
working with a clean dataset. In other words, developers and testers
should not have to go back to the database administrator and ask for
new test data. The ability to refresh data improves operational efficiency
while providing more time to test, thus enabling releases to be delivered
more quickly. In practical terms we do not believe that agile development is realistic without the ability to quickly refresh test data. Agile
development means agile test data.
A TDM solution should also support the generation of differently sized
non-production databases. As we noted at the outset, TDM has practical
implications for quality assurance, training, demonstrations or other
non-production purposes. You might want a larger dataset for integration testing than you do for unit testing or for training or demonstration
purposes. In other words, you should be able to generate right-size
databases according to your needs.
Data masking
For data masking you will need a range of different masking techniques
that run from simple shuffling and randomisation through to more sophisticated capabilities that preserve relationships and support relevant
business rules (such as being a valid post code). The goal is to ensure
that masked data is contextually correct so that testing processes will
run accurately. Proper data masking requires sensitive data discovery
and an understanding of relationships within and across databases.
From an implementation perspective such a solution should provide the
ability to propagate keys across tables to ensure referential integrity is
maintained. In addition to these capabilities, it will be useful if the data
masking solution (such as InfoSphere Optim Data Masking) supports
major ERP and CRM environments as well as custom developments.

A Bloor White Paper

2012 Bloor Research

Exploring Successful Approaches to Test Data Management

Best practices for test data management


Integration
TDM solutions should integrate seamlessly into software development
models For example, in Figure 2 we show how IBM has integrated test
data management into the quality testing process. The figure illustrates
a waterfall environment: for agile development of course this process
would be iterative.
As shown in Figure 2, IBM has built integration between Rational and
InfoSphere Optim. A notable feature is that InfoSphere Optim Test Data
Management solution offers a comparison feature so that you can inspect your data to see how it has been changed during testing.
IBM Greenhat complements InfoSphere Optim Test Data Management
by providing service virtualisation. From a TDM perspective this is
important because it caters to out-of-scope data or data that may be
outside your organisations control, such as web-based data or partner
data. Service virtualisation allows you to capture this data for testing
purposes.

Figure 2: Test flow showing integration between Rational and InfoSphere Optim components

2012 Bloor Research

A Bloor White Paper

Exploring Successful Approaches to Test Data Management

IBM InfoSphere Optim Test Data Management solutions


IBMs InfoSphere Test Data Management solution delivers all of the best practices outlined
above. It provides capabilities to: discover and
understand test data, subset data from production, mask test data, refresh test data and
analyse test data results. It actually consists of
a number of discrete but integrated products
that span data masking, data discovery and
data sub-setting that together make up the
IBM InfoSphere Optim Test Data Management
solution. As mentioned, this integrates with
the IBM Rational testing portfolio and you can
also use IBM Greenhat in conjunction with this
environment. Notable features include a range
of data masking options, the most advanced
discovery capabilities of any product currently
in the market, a data comparison capability,
data sub-setting based on an understanding
of data relationships (to ensure that test data
is representative) and support for multi-sized
data subsets.
Use cases
The following represent two examples of companies using the IBM InfoSphere Optim Test
Data Management solution:
Allianz Seguros is a subsidiary of the Allianz
Group where the former provides its more than
60 million clients worldwide with a comprehensive range of services in the areas of property and casualty insurance, life and health
insurance, asset management and banking.
The company relies on several mission-critical
mainframe insurance applications, developed
in-house, to manage operations in all areas
of its business activities. By implementing InfoSphere Optim Test Data Management solution, Allianz Seguros was able to reduce the
scope of its testing databases to 10 percent
of the production environment, which provided time and cost savings and also meant
that the integrity of the data was preserved.
Allianz also needed to protect confidential
client information to comply with the Spanish
Law of Protection of Personal Data (LOPD).
Prior to the implementation of InfoSphere
Optim Test Data Management solution the
DBA team had to write special programs to
mask client names, tax ID numbers and national identifiers and then move data into the
development and testing environments. Further, preserving the integrity of the test data
had previously presented another challenge
because the data structures that supported

A Bloor White Paper

the insurance applications included dozens of


complex relationships. Although the previous
in-house sub-setting program offered some of
the needed functionality, to ensure valid test
results the development team needed a TDM
solution that would accurately preserve the
referential integrity of the data for even the
most complex data relationships. InfoSphere
Optim Test Data Management solution satisfied all requirements.
Cetelem is a subsidiary of the BNP Paribas
Group. The companys core mainframe applications are based on a production database
with over 600 IBM DB2 tables consuming
more than 1 terabyte of capacity and, prior to
the implementation of InfoSphere Optim Test
Data Management solution, the qualification
and user acceptance testing environments
together consumed more than 24 gigabytes
of capacity. Moreover, cloning of the database
for testing purposes was a lengthy process,
especially for processing iterative testing scenarios and refreshing the test environment.
As a result, it became more time consuming
and costly to create and manage multiple
test environments. It was also more difficult
to validate the reliability of new functionality
and to complete testing processes in time to
deliver that functionality to business users,
partners and customers. Now, using the InfoSphere Optim Test Data Management solution,
developers and quality assurance testers have
capabilities to browse and edit the DB2 test
data and to force error conditions for targeted
test scenarios. These capabilities improve
the validity of the testing processes and support goals for delivering reliable applications.
InfoSphere Optim Test Data Management solution allows testers and developers to extract
specific test data by selecting a collection or
sampling of family entities that are already
defined. Extract requests are stored in queues
and executed online without impacting concurrent processing in Cetelems production
environment. When it is necessary to extract
larger volumes of data for testing purposes,
batch processing mode is used. Creating
and refreshing test databases takes much
less time because the extract specifications
can be shared and reused. InfoSphere Optim
Test Data Management solution recognises
relationships defined in the database and also
provides the capability to include relationships
that are defined in the application as a part of
test data generation.

2012 Bloor Research

Exploring Successful Approaches to Test Data Management

Conclusion
According to NIST, Planning Report 2002-2003: the Economic Impacts of
Inadequate Infrastructure for Software Testing the average testing team
spends between 30 and 50% of their time setting up test environments
rather than on actual testing and the estimated number of projects with
significant delays or quality issues is 74%. Further, according to Capers
Jones, Software quality 2011 A Survey of the state of the art, (http://www.
sqgne.org/presentations/2010-11/Jones-Nov-2010.pdf) poor software
quality costs $150+ billion per year in the US and over $500 billion
worldwide.
These figures very strongly suggest the need for better testing and software quality. Automated test data management represents a major step
in that direction. TDM improves the quality and accuracy of testing and
supports agile development where errors are caught early, fixed more
quickly and results in fewer defects for the project as a whole. Building
test data management into your development process will lower your
risk of late delivery penalties and improve customer satisfaction. By implementing intelligent sub-setting and data masking, organisations can
reduce storage and software costs. In our view, a formalised test data
management discipline provides a key strategic advantage compared to
traditional ad hoc methods.
Further Information
Further information about this subject is available from
http://www.BloorResearch.com/update/2138

2012 Bloor Research

A Bloor White Paper

Bloor Research overview


Bloor Research is one of Europes leading IT
research, analysis and consultancy organisations. We explain how to bring greater Agility to corporate IT systems through the effective governance, management and leverage
of Information. We have built a reputation for
telling the right story with independent, intelligent, well-articulated communications
content and publications on all aspects of the
ICT industry. We believe the objective of telling
the right story is to:
Describe the technology in context to its
business value and the other systems and
processes it interacts with.
Understand how new and innovative technologies fit in with existing ICT investments.
Look at the whole market and explain all
the solutions available and how they can be
more effectively evaluated.
Filter noise and make it easier to find the
additional information or news that supports both investment and implementation.
Ensure all our content is available through
the most appropriate channel.
Founded in 1989, we have spent over two decades distributing research and analysis to IT
user and vendor organisations throughout
the world via online subscriptions, tailored
research services, events and consultancy
projects. We are committed to turning our
knowledge into business value for you.

About the author


Philip Howard
Research Director - Data Management
Philip started in the computer industry way back
in 1973 and has variously worked as a systems
analyst, programmer and salesperson, as well
as in marketing and product management, for
a variety of companies including GEC Marconi,
GPT, Philips Data Systems, Raytheon and NCR.
After a quarter of a century of not being his own boss Philip set up his
own company in 1992 and his first client was Bloor Research (then
ButlerBloor), with Philip working for the company as an associate analyst. His relationship with Bloor Research has continued since that time
and he is now Research Director focused on Data Management.
Data management refers to the management, movement, governance
and storage of data and involves diverse technologies that include (but
are not limited to) databases and data warehousing, data integration
(including ETL, data migration and data federation), data quality, master
data management, metadata management and log and event management. Philip also tracks spreadsheet management and complex event
processing.
In addition to the numerous reports Philip has written on behalf of Bloor Research, Philip also contributes regularly to IT-Director.com and IT-Analysis.
com and was previously editor of both Application Development News
and Operating System News on behalf of Cambridge Market Intelligence
(CMI). He has also contributed to various magazines and written a number
of reports published by companies such as CMI and The Financial Times.
Philip speaks regularly at conferences and other events throughout Europe
and North America.
Away from work, Philips primary leisure activities are canal boats, skiing, playing Bridge (at which he is a Life Master), dining out and walking
Benji the dog.

Copyright & disclaimer


This document is copyright 2012 Bloor Research. No part of this publication may be reproduced by any method whatsoever without the prior
consent of Bloor Research.
Due to the nature of this material, numerous hardware and software
products have been mentioned by name. In the majority, if not all, of the
cases, these product names are claimed as trademarks by the companies that manufacture the products. It is not Bloor Researchs intent to
claim these names or trademarks as our own. Likewise, company logos,
graphics or screen shots have been reproduced with the consent of the
owner and are subject to that owners copyright.
Whilst every care has been taken in the preparation of this document
to ensure that the information is correct, the publishers cannot accept
responsibility for any errors or omissions.

2nd Floor,
145157 St John Street
LONDON,
EC1V 4PY, United Kingdom
Tel: +44 (0)207 043 9750
Fax: +44 (0)207 043 9748
Web: www.BloorResearch.com
email: info@BloorResearch.com

You might also like