You are on page 1of 2

c Identify your data's current state and determine data quality issues todevelop

standards
Identify the reusability of the existing data
c Early management of possible risk when integrating data with other applications
c Resolve missing values/erroneous values
c Discover formats and patterns in your data
c Identify cleansing issues to maintain the integrity of the data.
c Reveal hidden business rules
c Identify appropriate data values and define transformations so as to maintaindata
validity.
c Report on column minimums, maximums, averages, mean, median, mode, variance,
co-variance, standard deviation and outliers
c Measure business rule compliance across data sets
c Report results in various formats including PDF, HTML, XML and CSV
c Provide point in time data profiling history



  
   
Allocating sufficient time and resources to conduct a thorough data profiling assessment will
help architects design a better solution and reduce project risk by quickly identifying and
addressing potential data issues.

By Matt Austin

Data profiling is a critical input task to any database initiative that incorporates source data from external systems.
Whether it is a completely new database build or simply an enhancement to an existing system, data profiling is a key
analysis step in the overall design. Allocating sufficient time and resources to conduct a thorough data profiling
assessment will help architects design a better solution and reduce project risk by quickly identifying and addressing
potential data issues.

u  
How should you approach a new data profiling engagement and what can you expect in terms of value-added results?

Data profiling is best scheduled prior to system design, typically occurring during the discovery or analysis phase. The
first step -- and also a critical dependency -- is to clearly identify the appropriate person to provide the source data and
also serve as the ³go to´ resource for follow-up questions. Once you receive source data extracts, you¶re ready to
prepare the data for profiling. As a tip, loading data extracts into a database structure will allow you to freely write SQL
to query the data while also having the flexibility to use a profiling tool if needed.

When creating or updating a data profile, start with basic column-level analysis such as:
c × 
    Analyzing the number of distinct values within each column will help identify possible
unique keys within the source data (which I¶ll refer to as natural keys). Identification of natural keys is a fundamental
requirement for database and ETL architecture, especially when processing inserts and updates. In some cases, this
information is obvious based on the source column name or through discussion with source data owners. However,
when you do not have this luxury, distinct percent analysis is a simple yet critical tool to identify natural keys.
c   
     Analyzing each column for missing or unknown data helps you identify potential
data issues. This information will help database and ETL architects set up appropriate default values or allow NULLs
on the target database columns where an unknown or untouched (i.e.,., NULL) data element is an acceptable
business case. This analysis may also spawn exception or maintenance reports for data stewards to address as part of
day-to-day system maintenance.
c 3         Analyzing string lengths of the source data is a valuable step in
selecting the most appropriate data types and sizes in the target database. This is especially true in large and highly
accessed tables where performance is a top consideration. Reducing the column widths to be just large enough to
meet current and future requirements will improve query performance by minimizing table scan time. If the
respective field is part of an index, keeping the data types in check will also minimize index size, overhead, and scan
times.
c          Gathering information on minimum and maximum numerical and date values
is helpful for database architects to identify appropriate data types to balance storage and performance requirements.
If your profile shows a numerical field does not require decimal precision, consider using an integer data type
because of its relatively small size. Another issue which can easily be identified is converting Oracle dates to SQL
Server. Until SQL Server 2008, the earliest possible datetime date was 1/1/1753 which often caused issues in
conversions with Oracle systems.

You might also like