Professional Documents
Culture Documents
Data is first extracted from the source system and placed in a staging area. This staging area is typically formatted like the source system. Keeping data in the same format as the source makes the first extract simple and avoids bogging the source system down. You most likely will want to process only changed data, to avoid the overhead of reprocessing the entire set of data. This could be done by extracting data based on date/time information on the source system, mining change logs or by examining the data to determine what changed
Tip 1: Make sure the source system date/time information is consistently available. Use data profiling to validate. Tip 2: Store a copy of the prior version of data in the staging area so that it can be compared to the current version to determine what changed. Tip 3: Calculate check sums for both current and prior versions, then compare check sums rather than multiple columns. This speeds up processing. Tip 4: Add a source system prefix to table names in the staging area. This helps to keep data logically segregated.
Tools have been developed to scrub and standardize party information like SSN, names, addresses, telephone numbers and email addresses. This software can also remove or merge duplicate information ("de-duping"). Techniques available include: Audit Correct At Source Specialized Software (Address Correction Software) Substituting Codes and Values
Inaccurate Data
Overloaded Attributes
Some dimensions are loaded one time at the beginning of the data mart project such as: Calendar Date Calendar Month US State US Zip Code
Dimension Name Description Grain Primary Key Natural Key Descriptive Attributes
Date_Dim Dates of the year A single day Date_Key (generated integer) YYYY_MM_DD_Date Multiple date formats are stored, plus week, month, quarter, year and holidays. Both numeric dates and spelled out dates are included. The date dimension is loaded once, at the beginning of the dart mart project. It may require updates to correct problems to change attributes such as: company_holding_ind.
Maintenance Strategy
Alternate identifiers
measurements
Data Warehouse purchase_order_nbr line_item_nbr effective_date Effective_date product_code facility_number order_qty received_qty unit_price_amt
There are two levels of mapping, entity level and attribute level. Each target entity (table) will have a high level mapping description and will be supported by a detailed attribute level mapping specification.
dw_customer High level information about a customer such as name, customer type and customer status. dwprod1.dwstage.crm_cust dwprod1.dwstage.ord_cust crm_cust.custid = ord_cust.cust.cust_nbr crm_cust.cust_type not = 7 N/A
Then for each attribute the attribute level data map specifies: Source: table name, column name, datatype Target: table name, column name, datatype Transformation Rule Notes Transformations may include: Aggregate Substring Concatenate Breakout Array Values / Buckets