You are on page 1of 17

Extracting Data to Staging Area

Data is first extracted from the source system and placed in a staging area. This staging area is typically formatted like the source system. Keeping data in the same format as the source makes the first extract simple and avoids bogging the source system down. You most likely will want to process only changed data, to avoid the overhead of reprocessing the entire set of data. This could be done by extracting data based on date/time information on the source system, mining change logs or by examining the data to determine what changed

Tip 1: Make sure the source system date/time information is consistently available. Use data profiling to validate. Tip 2: Store a copy of the prior version of data in the staging area so that it can be compared to the current version to determine what changed. Tip 3: Calculate check sums for both current and prior versions, then compare check sums rather than multiple columns. This speeds up processing. Tip 4: Add a source system prefix to table names in the staging area. This helps to keep data logically segregated.

Applying Data Transformations


Data is now ready for transformation which includes cleansing, rationalization and enrichment. The cleansing process, sometimes called "scrubbing" removes errors while rationalization removes duplicates and standardizes data. The enrichment process adds data.

Tools have been developed to scrub and standardize party information like SSN, names, addresses, telephone numbers and email addresses. This software can also remove or merge duplicate information ("de-duping"). Techniques available include: Audit Correct At Source Specialized Software (Address Correction Software) Substituting Codes and Values

Missing, Incomplete and Wrongly Formatted Data


Common problems that may require correction are missing data, incomplete data and wrongly formatted data. In the case of missing data, a complete column such as zip code or first name is empty. A tool could correct the zip code based on look up of address lines, city and state. Incomplete data is partially missing such as the case where an address constains the name of a street without the building number. Tools are available that can correct some of these problems. Finally, data may be in the wrong format. We may want telephone numbers to contain hyphens. A tool could consistently format telephone numbers.

Applying Data Consistency Transformations


Consistent data is important for "apples to apples" comparisons. For example, all weight measures could be converted to grams or all currency values to dollars. Transformation could be used to make code values consistent such as: Gender ("M", "F") or ("y", "n") Boolean ("Y", "N") or (1, 0)

More Data Cleansing Issues


Correcting Duplicate Data Same Party with Different Names (T. Jones, Tom Jones, Thomas Jones) Dummy data like '111111111' for SSN Postal Code does not Match City / State Incorrect inventory balances Attributes mean different things in different contexts. Such as including price in SKU.

Dummy Data Mismatched Data

Inaccurate Data
Overloaded Attributes

Meaning Embedded in Identifiers and Descriptions

Loading the Data Mart


Loading the data mart through efficient and effective methods is the subject of this article. When loading the data mart, dimensions are loaded first and facts are loaded second. Dimensions are loaded first so that the primary keys of the dimensions are known and can be added to the facts. Make sure that the following prerequisites are in place: Data is stored in the data warehouse and ready to load in the data mart Data maps have been created for movement from data warehouse to data mart Grain is determined for each dimension and fact

Loading Data Mart Dimensions


There are specific prerequisites that must be in place for dimensions: Dimensions have surrogate primary keys Dimensions have natural keys Dimensions have needed descriptive, non-key attributes Maintenance strategy is determined for each dimension:
Slowly Changing Dimension (SCD) Type 1: Overwrite SCD Type 2: Insert new row - partitions history SCD Type 3: Columns in changed dimension contain prior data

Some dimensions are loaded one time at the beginning of the data mart project such as: Calendar Date Calendar Month US State US Zip Code

Dimension Name Description Grain Primary Key Natural Key Descriptive Attributes

Date_Dim Dates of the year A single day Date_Key (generated integer) YYYY_MM_DD_Date Multiple date formats are stored, plus week, month, quarter, year and holidays. Both numeric dates and spelled out dates are included. The date dimension is loaded once, at the beginning of the dart mart project. It may require updates to correct problems to change attributes such as: company_holding_ind.

Maintenance Strategy

Loading Data Mart Facts


Data mart facts consist of 3 types of columns: Primary key Dimensional keys Measurements In the data warehouse, there will be natural keys that can be joined with dimensions to obtain dimensional keys. For example:

Description Primary key

Alternate identifiers

measurements

Data Warehouse purchase_order_nbr line_item_nbr effective_date Effective_date product_code facility_number order_qty received_qty unit_price_amt

Data Mart purchase_order_fact_id

effective_date_id product_id facility_id order_qty received_qty unit_price_amt

Data Mapping for Data Warehousing and Business Intelligence


A Data Map is specification that identifies data sources and targets as well as the mapping between them. The Data Map specification is created and reviewed with input by business Subject Material Experts (SMEs) who understand the data.

There are two levels of mapping, entity level and attribute level. Each target entity (table) will have a high level mapping description and will be supported by a detailed attribute level mapping specification.

Target Table Name Target Table Description

dw_customer High level information about a customer such as name, customer type and customer status. dwprod1.dwstage.crm_cust dwprod1.dwstage.ord_cust crm_cust.custid = ord_cust.cust.cust_nbr crm_cust.cust_type not = 7 N/A

Source Table Names Join Rules Filter Criteria Additional Logic

Then for each attribute the attribute level data map specifies: Source: table name, column name, datatype Target: table name, column name, datatype Transformation Rule Notes Transformations may include: Aggregate Substring Concatenate Breakout Array Values / Buckets

You might also like