Professional Documents
Culture Documents
Page 2 of 14
4/17/2012
Overview
This document is designed for use by business associates and technical resources to better understand the process of building a data warehouse and the methodology employed to build the EDW. This methodology has been designed to provide the following benefits: 1. A high level of performance 2. Scalable to any size 3. Ease of maintenance 4. Boiler-plate development 5. Standard documentation techniques
ETL Definitions
Term ETL Extract Transform Load EDW Enterprise Data Warehouse DM Data Mart Definition The physical process of extracting data from a source system, transforming the data to the desired state, and loading it into a database The logical data warehouse designed for enterprise information storage and reporting A small subset of a data warehouse specifically defined for a subject area
Documentation Specifications
A primary driver of the entire process is accurate business information requirements. TDD Consulting will use standard documents prepared by the Project Management Institute for requirements gathering, project signoff, and compiling all testing information.
Tables
All destination tables will utilize the following naming convention: EDW_<SUBJECT>_<TYPE>
Page 3 of 14
4/17/2012
ETL Methodology Document There are six types of tables used in a data warehouse: Fact, Dimension, Aggregate, Staging, Temp, and Audit. Sample names are listed below the quick overview of table types. Fact a table type that contains atomic data Dimension a table type that contains referential data needed by the fact tables Aggregate a table type used to aggregate data, forming a pre-computed answer to a business question (ex. Totals by day) Staging Tables used to store data during ETL processing but the data is not removed immediately Temp tables used during ETL processing that can immediately be truncated afterwards (ex. storing order ids for lookup) Audit tables used to keep track of the ETL process (ex. Processing times by job) Each type of table will be kept in a separate schema. This will decrease maintenance work and time spent looking for a specific table. Table Name EDW_RX_FACT EDW_TIME_DIM EDW_CUSTOMER_AG ETL_PROCESS_AUDIT STG_DI_CUSTOMER ETL_ADDRESS_TEMP Explanation Fact table containing RX subject matter Dimension table containing TIME subject matter Aggregate table containing CUSTOMER subject matter Audit table containing PROCESS data Staging table sourced from DI system used for CUSTOMER data processing Temp table used for ADDRESS processing
ETL Processing
There following types of ETL jobs will be used for processing. This table lists the job type, naming convention, and explains the job functions. Job Type Extract Source LoadTemp LookupDimension Explanation Extracts information from a source systems & places in a staging table Sources information from STG tables & performs column validation Load temp tables used in processing Lookup dimension tables Naming Convention Extract<Source><Subject> ExtractDICustomer Source<Table> SourceSTGDICustomer LoadTemp<Table> LoadTempETLAddressTemp LookupDimension<Subject> LookupDimensionCustomer 4/17/2012
Page 4 of 14
ETL Methodology Document Transform QualityCheck Load Transform the subject area data and generate insert files Checks the quality of the data before loaded into the EDW Load the data into the EDW Transform<Subject> TransformCustomer QualityCheck<Subject> QualityCheckCustomer Load<Table> LoadEDWCustomerFact
Comments
Every job will have a standard comment template that specifically spells out the following attributes of the job: Job Name: LoadEDWCustomerFact Purpose: Load the EDW_Customer_Fact table Predecessor: QualityCheckCustomer Date: April 21, 2006 Author: Wes Dumey Revision History: April 21, 2006 Created the job from standard template April 22, 2006 Added error checking for table insert In addition there will also be a job data dictionary that describes every job in a table such that it can be easily searched via standard SQL.
Page 5 of 14
ETL Methodology Document ISSUE_CODE BATCH_NUMBER Data columns to follow NUMBER NUMBER Code uniquely identifying problems with data if STATUS_CODE = R Batch number used to process the data (auditing)
Auditing
The ETL methodology maintains a process for providing audit and logging capabilities. For each run of the process, a unique batch number composed of the time segments is created. This batch number is loaded with the data into the PSA and all target tables. In addition, an entry with the following data elements will be made into the ETL_PROCESS_AUDIT table. Column DATE BATCH_NUMBER PROCESS_NAME PROCESS_RUN_TIME PROCESS_STATUS ISSUE_CODE RECORD_PROCESS_COUNT Data Type DATE NUMBER VARCHAR TIMESTAMP CHAR NUMBER NUMBER Explanation (Index) run date Batch number of process Name of process that was executed Time (HH:MI:SS) of process execution S SUCCESS, F FAILURE Code of issue related to process failure (if F) Row count of records processed during run
The audit process will allow for efficient logging of process execution and encountered errors.
Quality
Due to the sensitive nature of data within the EDW, data quality is a driving priority. Quality will be handled through the following processes: 1. Source job - the source job will contain a quick data scrubbing mechanism that verifies the data conforms to the expected type (Numeric is a number and character is a letter). 2. Transform the transform job will contain matching metadata of the target table and verify that NULL values are not loaded into NOT NULL columns and that the data is transformed correctly. Page 6 of 14 4/17/2012
ETL Methodology Document 3. QualityCheck a separate job is created to do a cursory check on a few identified columns and verify that the correct data is loaded into these columns.
Source Quality
A data scrubbing mechanism will be constructed. This mechanism will check identified columns for any anomalies (ex. Embedded carriage returns) and value domains. If an error is discovered, the data is fixed and a record is written in the ETL_QUALITY_ISSUES table (see below for table definition).
Transform Quality
The transformation job will employ a matching metadata technique. If the target table enforces NOT NULL constraints, a check will be built into the job preventing NULLS from being loaded and causing a jobstream abend.
Quality Check
Quality check is the last point of validation within the jobstream. QC can be configured to check any percentage of rows (0-100%) and any number of columns (1-X). QC is designed to pay attention to the most valuable or vulnerable rows with the data sets. QC will use a modified version of the data scrubbing engine used during the source job to derive correct values and reference rules listed in the ETL_QC_DRIVER table. Any suspect rows will be pulled from the insert/update files, updated in the PSA table to a R status and create an issue code for the failure. Logging of Data Failures Data that fails the QC job will not be loaded into the EDW based on defined rules. An entry will be made into the following table (ETL_QUALITY_ISSUES). An indicator will show the value of the column as defined in the rules (H HIGH, L LOW). This indicator will allow resources to be used efficiently to trace errors. ETL_QUALITY_ISSUES Column DATE BATCH_NUMBER PROCESS_NAME COLUMN_NAME Data Type DATE NUMBER VARCHAR VARCHAR Explanation Date of entry Batch number of process creating entry Name of process creating entry Name of column failing validation 4/17/2012
Page 7 of 14
ETL Methodology Document COLUMN_VALUE EXPECTED_VALUE ISSUE_CODE SEVERITY ETL_QUALITY_AUDIT Column DATE BATCH_NUMBER PROCESS_NAME RECORD_PROCESS_COUNT RECORD_COUNT_CHECKED PERCENTAGE_CHECKED Data Type DATE NUMBER VARCHAR NUMBER NUMBER NUMBER Explanation Date of entry Batch number of process creating entry Name of entry creating process Number of records processed Number of records checked Percentage of checked records out of data set VARCHAR VARCHAR NUMBER CHAR Value of column failing validation Expected value of column failing validation Issue code assigned to error H HIGH, L LOW
Page 8 of 14
4/17/2012
Page 9 of 14
4/17/2012
Page 10 of 14
4/17/2012
Lookup Dimension
Page 11 of 14
4/17/2012
Transform
Page 12 of 14
4/17/2012
Quality Check
Page 13 of 14
4/17/2012
Load
Closing
After reading this ETL document you should have a better understanding of the issues associated with ETL processing. This methodology has been created to address as many negatives as possible while providing a high level of performance and ease of maintenance while being scalable and workable in a real-time ETL processing scenario.
Page 14 of 14
4/17/2012