You are on page 1of 19

ETL Framework Methodology

Prepared by Michael Favero & Kenneth Bland 1 The ETL Architecture


It seems that many companies approach their data warehouse unlike any other internal software application development they have done. Twenty-five years of software application development and lifecycle methodology go out the window. An ETL application is the same as any other application when it comes to development. It undergoes the same evolutionary steps as any other application development. The goals of the ETL application development process should be as follows: Modular, repeatable, and reusable code segments Boiler-plate or cookie-cutter approach to processes Self documenting process flows Fully metadata aware and cooperating processes

1.1 ETL functionality

This document contains a boilerplate approach to development a data warehouse ETL process and the steps necessary to implement it. The starting point for the ETL process is to define its functionality. The ETL process will move data from one or many source systems and populate a data warehouse with that data. The data will undergo several levels of transformation, as the data is conformed to standards in the warehouse and transformed from the source system data model into the data warehouse data model. The ETL process will contain audit checks, restart and fail-over logic, and volumes of process metadata reporting. The entire process will be metadata aware, meaning that definitions of tables and files will be documented internally throughout the entire process, and that all process metadata and impact analysis metadata is shared. The ETL will populate the data warehouse by the order of subject areas. Because subject areas have foreign key relationships, it makes sense to start with the lowest level subject area first. The ETL process will have a natural hierarchy to its processing dictated by these foreign key constraints. The dimensional data model has advantages over a 3NF model in that the subject areas have no foreign key interdependencies among the dimensions, and also among the facts. This benefits the process when determining parallel processing capability. The 3NF model suffers from the same old song of transforming and loading through the outer tables in approach. The ETL process will also have the responsibility of being parameter driven so that it is configurable at runtime. This will play into setting up an application that responds to the readiness of certain source systems for processing while others may be unavailable. The process will further be constructed of reusable code modules, operating under the rules of a black box. This further enhances the robustness of the system and ease in maintenance. On that note, the process will be targeted towards being maintainable, rather than wired for speed out of the gate. The data warehouse environment is

one of change, and the processes must be ready to adapt to the needs of an ever-changing-its-mind organization.

1.2 Data staging and process segmentation


The single most valuable lesson to be learned in designing ETL processes is to keep things simple. The tendency to overcomplicate ETL processes is abundant because of the wealth of talent in software development. Developers will often design something so complex and rely on their sheer talent to bail themselves out of a coding mess. ETL is extremely simple: Extract, transform, then load the data. Sounds easy, right? This microscopic view holds true when a developer is looking at a single source table, a single target table, and minor transformations and joins. The designer naturally builds a single process to extract, transform, and load the data. This process can be depicted as follows:

1.3 The Simple ETL picture:

As the number of processes increase, because more sources and targets exist, the designer has to take a more architectural tack and design a system. Many processes may have a common join to a temporary work table, and therefore those processes have a predecessor requirement that this temporary work table be constructed prior to the execution of the dependant processes. As the number of work tables increase, and interdependencies between the source tables become more pronounced, an architecture has to rise to the forefront that elegantly coordinates these efforts. No better example exists than that of the foreign key relationships that exist in the target Data Warehouse (DW). The order of precedence in processing dictates that an architecture has to coordinate the transformation from the source systems into the target DW. The best ETL solution will also inherently provide staging points for the data as it moves through the ETL cycle. Data staging is the act of recording the data into a permanent state, which is ideally a sequential file. For example, after extracting the data from the source system, the table is written to a sequential file. After transformation, another sequential file is written.

1.4 Data staging has several benefits:


An archive ready snapshot of the source data is recorded for subsequent review. Development is assisted in debugging a process by providing stepping points in transformation. Production support is aided by allowing archive snapshots to be offloaded to a test environment for problem event recreation. Reloads are facilitated by having a ready-to-load file set available.

A checkpoint is placed into the process, facilitating a restart from this checkpoint. The act of staging data also puts checkpoints into the process, because the process segments along these staging files. In essence, the developer can create more modular processes that have specific functionality: 1. Process Source: extract from a source system table into a sequential file. 2. Process Transform: transform the collected source data into a target table sequential file. 3. Process Load: load the target table sequential files into the target database. As more complex derivations develop, these process segments will no longer be the only ones necessary to transform the source data. Intermediate work tables can greatly simply the transformation logic. In addition, advanced techniques can be introduced that allow creation of recovery images of rows targeted for update. Also, other techniques can be added to each processing segment to enhance the benefit of this approach. Ultimately, a well-designed ETL process can have the following 8 step process:

1.5 The Well Designed Eight Step ETL Process


1.5.1 Source Extract:

Select or Extract data from a source system table(s) into a sequential file. Keep the Select as simple as possible, complex joins should be avoided by using sandbox table joins when appropriate. Deciding when to join data at the source and when to sandbox tables requires looking at the entire ETL system as a whole. Some rules of thumb: o Use a merge operation in an ETL sandbox when: o The joined table data will be used by multiple sources. o The rules of the join are complex and could be more easily understood (documented) by using the ETL tool. o The performance of the join is very poor. o The Source System does not allow the tables to be joined. (Sequential Files or disparate DBs) Use Source System JOINs when: o The joined table data is only connected to a single source AND none of the above apply. o There are very few columns needed from the joined table. o Very few rows will be selected from a very large joined table. o The join will limit the number of source rows selected from the PRIMARY/DRIVING table and thereby reduce the number of rows processed through the ETL.

Cleanse the source data and conform it to the warehouse standards. TRANSFORMATION of the data is NOT done during this step. For example: o o o o Eliminate junk data, cleanse invalid columns, and audit incoming data. Simple TRIM of spaces Differentiate current data snapshot to previous snapshot for rudimentary changed data detection. The resulting data sets should be in sequential files.

The next three steps are all concerned with building the sandbox version of the Data Warehouse. We try very hard to NEVER USE PERSISTENT SANDBOX tables. Instead by inspecting the source keys coming in we build a subset of the DW in the sandbox database. It will become more apparent as we proceed why we do this for Restart and Recovery reasons.

1.5.2 Lookup Load:

Create custom DW temporary tables that will facilitate transformation. These temporary DB tables are used to join to the large DW table and create subsets or sandbox extractions.
1.5.3 Lookup Extract:

Joining the above temporary tables to the large DW table or extracting the entire table in the case of small DW tables, create a sequential file before image of the DW rows to be updated also subsets or sandbox extractions.
1.5.4 Lookup Build:

Using the sequential before image, create a sandbox table copy of the DW table (stgXXXX) as well as a cross-reference lookup of source system keys to surrogate keys (skeyXXXX). These sandbox tables form the subset or sandbox of the Data Warehouse.
1.5.5 Transform:

Transform the collected source data (sequential file from step 1) into updates/inserts to the sandbox table. Eliminate junk data, cleanse invalid columns, and audit after transformation data. (Note: cleansing occurs during both sourcing and transforming, because transformation may cause a resulting value to violate the definition of the target column contents, from the database and business rule aspects.) Differentiate final transformed data to current warehouse data for rudimentary changed data detection. The resulting data will normally be stored in a table representing the sandbox. (Note: changed data detection can happen in two places: sourcing and transforming. A row may only be determined to have changed once transformation is complete because the change may occur on a referenced (lookup) table row. This will be noted below as the most insidious problem in determining change data. If one decides to compare the data rows at the sourcing image point and throws out the row because it is the same as yesterdays source image there is potential risk of NOT determining the change later at the transforming point. At this later point the joining of the lookup data may show a CHANGE DUE TO REFERENCED DATA.)
1.5.6 Load Build:

Create Load Ready before image sequential files of target rows within the data warehouse to provide database recover capability. (one for inserts and one for updates).
1.5.7 Load Insert:

Load the target table INSERT sequential files into the target database.
1.5.8 Load Update:

Load the target table UPDATE sequential files into the target database. NOTE: Steps 2, 3, 7 & 8 can be done with simple SQL scripts or Bulk Load utilities.

1.6 Horizontally Banded ETL


Once the processes are being developed into the eight jobs or process steps, the execution order of these segments must be addressed. There are two basic ways: horizontal and vertical banding. Since the microscopic view of ETL would be to Source, Lookup, Transform, and Load all in one scripted process, you end up with a whole series of horizontally banded scripts. For example: Master Controlling Script Source #1 Source #2 Source #3 Source #4 Lookup #1 Lookup #2 Lookup #3 Lookup #4 Transform #1 Transform #2 Transform #3 Transform #4 Recover #1 Recover #2 Recover #3 Recover #4 Load #1 Load #2 Load #3 Load #4

1.7 Vertically Banded ETL


Now, when multiple transforms share references to the same lookups then the creation of the lookups put a relationship on these processes. The lookups become a predecessor requirement to many transformation jobs. This can quickly scale out of a manageable design. For this and many other reasons, the simpler approach is to vertically band the segments. For example: Master Controlling Script Source #1 Source #2 Source #3 Source #4 Lookup #1 Lookup #2 Lookup #3 Lookup #4 Transform #1 Transform #2 Transform #3 Transform #4 Recover #1 Recover #2 Recover #3 Recover #4 Load #1 Load #2 Load #3 Load #4

This simple approach enables so much: Vertical banding breaks processing into non-related segments that have parallel processing capability. Sourcing processes can all run concurrently, within the limits of the hardware. This creates a more simultaneous snapshot of source data, providing a tighter clustering of the data than horizontal banding, which spreads sourcing out across a wider time range. Lookups are all parallel, with the few exceptions that require serial processing. Transform is serial where foreign key validation and/or surrogate key substitution is necessary, and parallel whenever possible. Recover processing is completely parallel. Lastly, the load segment is parallel where foreign key constraints allow. A go or no-go decision point exists at each banding segment. For example, after sourcing all of the data necessary for transformation, the exception count during data cleansing exceeded that of a predefined tolerance. The master controlling process can determine that the sourcing phase has failed and elegantly shutdown processing. There are no long running threads to worry about, or wait for wrap-up. The target database is still up and running, without having any data modification. This capability exists all the way up to the load phase. Production support reset/restart is greatly simplified, as the process lends itself to ease in resetting. The well-designed ETL process should be coded to cleanup its work areas at its master controlling script initialization time. Restarting the process from the beginning automatically means resetting the work area. Restarting from the last successfully completed segment will not reset the work environment. Test Environment can be built from the before and after images in order to recreate problems or just capture good test data. Maximum process tuning can occur because each segment utilizes 100% of the resources needed for processing. For example, sourcing is dependant on how fast the source system can select the data, transfer it across a network, cleanse it, and write it to a sequential file, in that order. If a process was sourcing, referencing, transforming, and loading all in that same process, you cannot answer the

question of which is the limiting factor: the sourcing, referencing, transformation or loading. Separate modular processes provide maximum tuning capability. Transformation is still regulated by foreign key dependencies and therefore will be mostly serial processing. Loading can be delayed and does not need to occur during the warehouse update processing. This gives added flexibility in that sometimes a companies sourcing window differs from that of its load window. In addition, if the loading fails because of insufficient access to the target database, transaction log overrun, etc., the loading is easily restartable. The most noticeable drawback to this approach is that some ETL processes are designed to require a predecessor process conduct its inserts and updates. For example, if an ETL process has to load customer and order information, the customer load must precede the order load, correct? But doesnt this make it difficult to cancel a warehouse refresh process once youve started loading data? The correct solution is to fully transform all data first and be satisfied with the result set before loading. This requires that a careful sandbox design take place that will allow the ETL process to act on a work database. This is covered more during the construction discussion. The second noticeable drawback is that it seems vertically banded processes may not be as efficient in processing as a horizontally banded process. Horizontally banding introduces no waiting as simple processes move from sourcing through to load. Vertical banding requires that all processes within that banding segment complete before the next banding segment is begun. This means that the longest running source, lookup, transform, or load process determines the runtime for that particular banding segment. The transformation processing segment also contains many serially dependent jobs and therefore its runtime is the sum runtime of these dependent component processes. However, as the number of shared lookups and foreign keys increase, the complexity in scripting the horizontal process due to the interdependencies of the lookup/transformation jobs increases much faster. All of the other benefits of vertical banding outweigh any speed benefits that horizontal banding MAY have. In fact, vertical banding is very likely to be FASTER than horizontal banding because with horizontal banding, ENTIRE threads of ETL processing (from sourcing to loading) become dependent upon the completion of other ENTIRE threads (from sourcing to loading). We design ETL architecture for maintenance first, and performance second. Once a maintenance-oriented architecture is in place, its readily apparent which processes need tuning. As mentioned previously, multiple systems integration will require that a work database be created. The target data model will be used as the work database model. The DW was recreated within the sandbox database. The sandbox tables would be populated at run time with the contents of the DW necessary for ETL processing of that current source data set. This allows the ETL to work within the confines of the sandbox for optimum processing speed, as well as giving the ETL a place to insert & update rows, assign surrogate keys and reference and foreign key assign these future target rows BEFORE they are actually loaded to the DW. At completion of the ETL process, the rows inserted and updated are simply extracted from the sandbox tables to load ready sequential files and then loaded to the real target database. As the ETL has to provide a cross-reference/translation between the source system keys and the surrogate keys assigned in the warehouse, two sandbox tables exist for every table in the DW. These sandbox tables are named skeyXXXX and stgXXXX and are described below: The skeyXXXX tables are the surrogate key translation tables. The primary key of these tables is the source system code and source system key. The only attribute is the surrogate key assigned in the DW. The stgXXXX tables are exact copies of the DW tables for smaller DW tables and sandbox subset images for larger DW tables. The method of reducing the size of these tables for the larger DW tables is described later as the volume reduction of rows technique.

The ETL architecture provides for a mini-copy of the DW at run time. The ETL process is free to update and massage this mini-copy at will without impacting the DW. A successful ETL process will then feed the necessary updates for the DW. All rows necessary for processing have been staged within the sandbox database, and all source data is in sequential form. All processing will take place on the ETL server platform. Actual update of the target can be delayed, and in the event of any issue the process has maximum flexibility for restart.

Surrogate key assignment


ETL developers often hear about surrogate key assignment, but are rarely told when to use surrogate keys. First, what is a surrogate key? A surrogate key is a new integer key value assigned to a row of unique data. Why would a warehouse need surrogate keys? The simple answer is to use surrogate keys when multiple sources of data commingle into a single table. For example, a company has two databases with a customer table in each database. They wish to put all of their customers into a single table in their data warehouse. The dilemma is that the customers are unique to each source table, but the primary keys are repeated across both databases. This means that customer 1000 in database #1 is not the same customer in database #2. If you put these customers into the same table, using the originating source system key as the primary key, one of these customers becomes the insert and the other becomes the update. Modelers account for this by including into the warehouse schema an extra column in each table that identifies the source system origination. This code identifies the source system, and therefore the duplicated source system key and source system code make up the compound primary key. However, this tends to gum up the elegance in any data model. This is where the surrogate key comes into play. The table is modeled with a totally new single integer column that is the primary key for the table. The source system key and source system code columns are now alternate keys for the table. During ETL processing, the tables are queried by their source system key and source system code to determine if the row exists in the warehouse. If the row exists, the assigned surrogate key is returned; allowing the ETL developer to know which row is to be updated. If the row doesnt exist, this source system key and source system code combination can be assigned the next available surrogate key. Another reason for surrogate key assignment is to facilitate more efficient joins within the warehouse database. Some transactional systems use varchar values as primary keys. These keys can be quite lengthy, and become very inefficient during join processing. A lot of legacy systems also have intelligence in their keys and people know column for column what each digit or letter represents. By replacing these varchar keys with integer surrogate keys, a significant increase in join performance can occur. This example also holds true if many columns make up a compound primary key. This also reduces the risks of Cartesian product joins in a warehouse. The method for employing surrogate keys begins with identifying the source systems and assigning them source system codes. In the case of only one source system, you may dispense with the source system code on peril of having to rework the entire model should another source system become available. It is preferable to use two to three digit codes, but the idea is to use a meaningful value that everyone will recognize. The second step is to determine which source system table maps into what data warehouse table. This process should have been done during the analysis and the detailed scenario mapping. The columns that are the primary keys will now be referred to as the source system keys. The source system key is the means of identifying the row in the source system when looking at the data in the data warehouse. This value may be a concatenated string of values because of a compound key in the source system.

Example: source systems Alpha and Beta are integrating their databases into the data warehouse. Both source systems have customer tables that have primary keys of the sequentially assigned customer numbers. Each source system has independent customer history, and even though the same customer may exist in both systems, it is not currently possible to associate customers in one system to the other. Customer_ID <PK> Name Title Create_Dt Status Category Region Territory Salesrep Address_1 Address_2 Address_3 City State Zip Zip_4 Alpha Customer Table

Mapping And Transformatio

Cust_ID <PK> Cust_Version <PK> Start_Date End_Date Status_Code Name Address_1 Address_2 Address_3 City State Zip Phone Phone_Ext Salesman Region Territory Last_Order_Date

Warehouse Party Table

Party_ID <PK> Source_System_Code Source_System_Key Party_Type Party_Name Party_Title Party_Create_Dt Party_Status Employee_Nbr Employee_Dept Employee_Loc Employee_Ext Customer_Cat Customer_Reg Customer_Terr Vendor_Rep Vendor_Nbr Original_Batch_Nbr Batch_Nbr Version_Nbr

Beta Customer Table

During analysis, it was determined that the correct mapping for the two source systems is as follows:
Scenario Number 1. Scenario Description First time occurrence insert Driving Table Alpha:Cus tomer Target Table Warehouse: Party Target Column Party_ID Source_System_Code Source_System_Key Party_Type Party_Name Party_Title Party_Create_Dt Party_Status Employee_Nbr Employee_Dept Employee_Loc Employee_Ext Customer_Cat Customer_Reg Customer_Terr Vendor_Rep Vendor_Nbr Original_Batch_Nbr Batch_Nbr Version_Nbr Party_ID Derivation ASSIGN(Party) A Alpha:Customer.Customer_ID C Alpha:Customer.Name Alpha:Customer.Title Alpha:Customer.Create_Dt CONFORM(Party.Party_Status, Alpha:Customer.Status, Alpha:Customer.Status) NULL NULL NULL NULL CONFORM(Party.Customer_Cat, Alpha:Customer.Category, Alpha:Customer.Category) CONFORM(Party.Customer_Reg, Alpha:Customer.Region, Alpha:Customer.Region) CONFORM(Party.Customer_Terr, Alpha:Customer.Territory, Alpha:Customer.Territory) NULL NULL BATCHNUMBER BATCHNUMBER 1 SELECT Party.Party_ID FROM Warehouse:Party WHERE Party.Source_System_Code = A AND Party.Source_System_Key = Alpha:Customer.Customer_ID Alpha:Customer.Name Alpha:Customer.Title CONFORM(Party.Party_Status, Alpha:Customer.Status, Alpha:Customer.Status) CONFORM(Party.Customer_Cat, Alpha:Customer.Category, Alpha:Customer.Category) CONFORM(Party.Customer_Reg, Alpha:Customer.Region, Alpha:Customer.Region) CONFORM(Party.Customer_Terr, Alpha:Customer.Territory, Alpha:Customer.Territory) BATCHNUMBER CURRENT.Version_Nbr + 1

2.

Update of existing row

Alpha:Cus tomer

Warehouse: Party

Party_Name Party_Title Party_Status Customer_Cat Customer_Reg Customer_Terr Batch_Nbr Version_Nbr

3.

First time occurrence insert

Beta:Cust

Warehouse: Party

Party_ID Source_System_Code Source_System_Key Party_Type Party_Name Party_Title Party_Create_Dt Party_Status Employee_Nbr Employee_Dept Employee_Loc Employee_Ext Customer_Cat Customer_Reg Customer_Terr Vendor_Rep Vendor_Nbr Original_Batch_Nbr Batch_Nbr Version_Nbr Party_ID

ASSIGN(Party) B Beta:Customer.Cust_ID + ~ + Cust_Version C Beta:Cust.Name NULL Beta:Cust.Start_Date CONFORM(Party.Party_Status, Beta:Cust.Status_Code, Beta:Cust.Status_Code) NULL NULL NULL NULL R CONFORM(Party.Customer_Reg, Beta:Cust.Region, Beta:Cust.Region) CONFORM(Party.Customer_Terr, Beta:Cust.Territory, Beta:Cust.Territory) NULL NULL BATCHNUMBER BATCHNUMBER 1 SELECT Party.Party_ID FROM Warehouse:Party WHERE Party.Source_System_Code = B AND Party.Source_System_Key = (Beta:Customer.Cust_ID + ~ + Cust_Version) Beta:Cust.Name CONFORM(Party.Party_Status, Beta:Cust.Status_Code, Beta:Cust.Status_Code) CONFORM(Party.Customer_Reg, Beta:Cust.Region, Beta:Cust.Region) CONFORM(Party.Customer_Terr, Beta:Cust.Territory, Beta:Cust.Territory) BATCHNUMBER CURRENT.Version_Nbr + 1

4.

Update of existing row

Beta:Cust

Warehouse: Party

Party_Name Party_Status Customer_Reg Customer_Terr Batch_Nbr Version_Nbr

During processing of the Alpha source system rows are assigned a source system code = A while source system Beta is assigned a source system code = B. The rows are identifiable back to their source system by knowing the construction behind the source system key. Its important to understand why all source system primary keys map to a single warehouse source system key column. As in the above example, the modeler cannot model a customer table using either source system primary scheme, as a third source system can throw another variation into the mix. Surrogate keys are supposed to be meaningless values to the end-user. Many ETL designers strive to build ETL solutions that do not introduce gaps in surrogate key assignments. The reasoning is that processes that allow gaps in the assignments are more difficult to audit for problems. Simply put, the row count should match the maximum surrogate key in use. This does not always work this way in the real world. There will be times when processing problems, or database rejects will cause gaps in the assignment. This is true if the rejected rows are unable to be re-attempted for load. Sometimes, the ETL work involved with insuring that surrogate keys stay contiguous is not worth the effort, and you still will have gaps. The question is always asked: Why cant the database assign the surrogate key? Many databases supply the ability to generate surrogate keys on a rows insertion. The problem with allowing the

database to supply the surrogate key is that it involves actually updating the target database during transformation. For example, if there are two target tables in the warehouse, a customer and an order, then the proper segmentation of the process means that the customer data and order data are sourced together. Next, the lookup segment is completed, and now the transformation can begin. The customer transformation occurs first, inserting into the target database. Once the customer transformation process is complete, the order transformation process begins. The order transformation process has to retrieve from the target customer table the surrogate key assigned for that customer in order to perform the surrogate key substitution when it places the surrogate key into the order row. If we let the target DataBase assign the surrogate keys then of course, the target table must actually be loaded. This exposes our ETL process to potential partial loading and requires the target DB to be available for a much longer window. In addition, the lookups of these surrogate key values directly from the target tables during the order process will be extremely slow as compared to using sandbox tables for the lookup. Our solution of having the ETL process assign the surrogate keys during the transformation process into sandbox tables addresses these issues. The target DB remains in query only mode until the final loading step of ALL target tables. The ENTIRE ETL process can be completed against the sandbox (see the later section titled: The Staging Database) before the target warehouse need be locked for insert/update processing. Lookups from the sandbox tables comprising the Sandbox are far more efficient due to their size and ETL specific indexing. Thus, restart and recovery is much simpler.

Data lineage
Data lineage is the ability to trace the path a column value takes either up the derivation stream or down. Companies will use this capability to see from where a column in the target table is derived. Although the mapping scenarios supply this information, the scenarios may become out of date and may no longer be reliable. It is recommended that under all cases the mapping scenarios database be kept current, but it is recognized that the mapping scenarios will need to be supplemented with a process that can do the research based on the process design. Any ETL tool chosen must have the capability to show its entire derivation path for a target column. This becomes difficult for many tools to accomplish outside of a single job. This is because not all transformation work will happen within a single job. Several jobs may accomplish the goal of transforming the data. In the case of the ETL process architecture shown in this document, the source column may go through several jobs on its way to the target. The data lineage capability must be able to connect the dots between jobs. Since the jobs are writing and reading sequential files, the implied derivation path can be determined by analyzing the sequential files and their usage in a job. A job that reads a sequential file can be implied as having followed a job that wrote the sequential file. By having a consistent naming standard that has a convention for these sequential files, the order of the jobs can be understood. Examining the job designs can show one job writing a sequential file and another reading the sequential file. This completes the circuit and the bi-directional analysis can continue. The same method holds true for relational tables. A simple linked list style approach can

quickly resolve the job dependencies. The ability to add rules for the lineage become a must-have in the ETL tool in order to bridge the gap when multiple jobs are part the derivation process. Tracking data lineage from the target column back to the source cannot even begin unless the columns described in the next section are added to the target model.

The required ETL attributes in the Data Warehouse


The previous discussion of surrogate keys shows a few columns that need to be added to the data warehouse model for the purpose of determining surrogate key assignment. Listed below are all the columns needed for ETL best practices in the target DW and their purpose:

Column Name
Source_System_Code Source_System_Key Original_Batch_Nbr Batch_Nbr Version_Nbr

Purpose
Identifies which Source System inserted/updated this row LAST. (depends on table type 1, 2, or 3). Used for Surrogate Key assignment & Data Lineage Identifies natural Source System Key. Used for Surrogate Key assignment & Data Lineage Optional original batch identifier populated only on insert of the new record. Used for Surrogate Key assignment & Data Lineage. Batch identifier populated at insert and updated each time a row is updated. Used for Surrogate Key assignment & Data Lineage Counter incremented for each update for type 1 dimensions and updateable facts. Used for Data Lineage. More importantly for SCDs and atomic level facts, this column provides the simplest method of ordering the versions of the same source system key. It is the only way to insure that all versions for the same source system key have been included in a query. I.e. a contiguous set of version numbers. Used for Surrogate Key assignment & Data Lineage.

The staging database (Sandbox)


In an ETL process, an ETL specific staging database can be a valuable aid in developing complex transformations. A staging database allows the ETL process to play what if with incoming data, thus the term sandbox applies. This means that the data can be transformed and test inserted or updated into a staged copy of the target table. This becomes especially powerful if many rows in the incoming data stream are modifying either an existing row in the target table, or modifying a row that would be inserted during this processing. To design the sandbox, simply take the DDL for the target database and create the sandbox. The indexing strategy on the sandbox becomes quite simple, in that only the columns needed for ETL transformation processing will have to be addressed. In addition, foreign key constraints can be eliminated, as the ETL process will enforce foreign key constraints via the surrogate key substitution. Every ETL cycle the sandbox will be truncated and reset to the current contents of the target database. Volume management techniques will be applied in the process that refreshes the sandbox. In the sourcing phase of ETL processing, the source system codes and source system keys on the incoming rows will be captured and put into temporary tables in the target database. These temporary tables will then be used in an inner join to return only those rows in the target database that are pertinent to the incoming processing. This technique is covered more thoroughly in the next section. The sandbox also has one more additional benefit. During ETL architecture meetings, exceptions and error handling becomes a critical issue. The sandbox allows a simple interface to the data that is

proposed to go on to the target database. The IT support staff have a simple interface into the data for manual correction of issues and the research into those issues. This can be a more reliable method for manually correcting transformed data than editing sequential load files. Likewise, the sandbox can have audit scripts run against the data to insure validity and meet quality controls. These audit scripts can be simplified down to the level of SQL scripts.

Null columns
A lot of hubbub is raised all throughout the warehouse engineering regarding null columns. There are a couple of schools of thought on null columns. The question raised is actually a simple one: Do you allow columns in a data warehouse to contain null values, or do you populate the columns with a representative value? First, some background on nulls and how databases and query tools treat them. A null is a value that has an indeterminate state, which means that the value cannot be accurately described. Therefore, any operation on a null value will always return a null. For example, adding a value to a null always produces a null. It would follow that concatenating a value to a null will also produce a null. This is why SQL has a separate function for checking if a value is null. Any comparison to a null value always produces a null. Now query tools tend to be very smart when working with nulls. Many tools will treat a null as a value to disregard when doing mathematical operations. The ramifications of nulls play into how the tools count rows of resultant data. For example, if you have 100 rows of data with a sales amount column and you sum the total sales amount column, the null rows are not treated as zero but they are actually ignored! This is apparent when you divide the summed total sales amount value by the row count for the average sales dollar amount. The columns that had null sales amounts are ignored, and therefore the ratio only reflects those rows that were not null. Query and reporting tools insulate the end user from nulls when concatenating text also. The most common solution in data warehouses is to put a representative value into the warehouse for the null condition. Some example columns and their recommended values are:
Null Date Null Start Date Null End Date Numeric Text 1800-01-01 batch processing date 2999-01-01 0

a single space

Nulls in a warehouse table are then identifiable as rows that were not correctly transformed or loaded. During transformation, if a calculation went awry, or null data made it through a data cleansing process, then this value makes it all the way into the warehouse. Another case is when the load ready data does not match the column level characteristics, and the database conveniently nulls the data for you. This is one reason by warehouse modelers set all columns in a warehouse to not allow nulls. When the emphasis is on loading all of the data, no matter the state, then columns are set to allow nulls. After processes check the warehouse tables for nulls.

Inserted or updated row identification


As in the previous sections examples, you might have noticed a column name Warehouse:Party.Batch_Nbr. When transactional system modelers first start to model data warehouses, methodologies tend to carry over into the warehouse. One of the hardest habits to break is the notion that every table in the warehouse has a column that will track to the millisecond the time

the row was inserted or last updated. Last_Update_Date. Reasons to avoid a date or timestamp:

These columns always vary around the name

! The Last_Update_Date column is not needed in a data warehouse! A dynamic call to the system clock to determine to the millisecond when a row is being put into a warehouse is an unnecessary burden. A date or timestamp cannot be used to absolutely determine what rows were affected by a warehouse refresh cycle. Just using a date disallows multiple loads in the same day. A timestamp only approaches usefulness if its captured at refresh cycle initiation and passed to all processes, therefore associating the rows. A date or timestamp does not have clarity as to the nature of the refresh cycle, nor does it contain other useful information. A better design is to include a column called Batch_Nbr to every table. Since a warehouse refresh cycle is a batch process, it can be assigned the next available sequential id. The ETL architect will design tables that support the ETL processing. The most import table will be the table that tracks all refresh processes. At a minimum, the following table should be added to the warehouse model: Batch_Nbr <PK> Processing_Date Start_Timestamp End_Timestamp Status Batch_Type User_ID Comments

ETL_Batch_Audit

At initiation of the warehouse refresh cycle, the ETL_Batch_Audit table is queried for the maximum Batch_Nbr, and adds one to this value. This becomes the batch number for the ensuing processing. The row can be updated during the processing to reflect the current state of the batch process. The addition of another column, Original_Batch_Nbr, allows the batch that inserted the row to be recorded. This provides an origination audit trail for the row, where Batch_Nbr provides the most recent update audit trail.

Process Audit
The ETL_Batch_Audit table becomes the starting point for all of the process auditing. It is extremely important to architect an ETL solution that is easily self-documenting, as well as consistent in design throughout. The following tables will assist in the efforts of process auditing:
ETL_Links_Audit Batch_Nbr: INTEGER Process_Name: VARCHAR(40) Link_Name: VARCHAR(40) Row_Count: NUMBER

ETL_Batch_Audit Batch_Nbr: INTEGER Processing_Date: DATE Start_Timestamp: DATE End_Timestamp: DATE Status: VARCHAR(20) Batch_Type: VARCHAR(20) User_ID: VARCHAR2(20) Comments: LONG VARCHAR

ETL_Process_Audit Batch_Nbr: INTEGER Process_Name: VARCHAR(40) Start_Timestamp: DATE End_Timestamp: DATE Reject_Count: INTEGER Status: VARCHAR(20) Comments: LONG VARCHAR

ETL_Table_Audit Batch_Nbr: INTEGER Process_Name: VARCHAR(40) Target_Name: VARCHAR(40) Insert_Count: INTEGER Update_Count: INTEGER Reject_Count: INTEGER

ETL_Batch_Audit: Batch_Nbr <PK> Processing_Date Start_Timestamp End_Timestamp Status Batch_Type User_ID Comments

A simple sequentially assigned integer. The date that all processes will use instead of the system clock date. This prevents a midnight rollover issue. The physical system clock timestamp when the warehouse refresh cycle was initiated. The physical system clock timestamp when the warehouse refresh cycle finished. The status of the batch: Running, Failed, Success. The type of the batch: Automatic, Restart, Manual. The user ids of the owner of the warehouse refresh process. Any commentary automatically generated or user entered that describes the warehouse refresh process.

ETL_Process_Audit: Batch_Nbr <PK> Process_Name <PK> Start_Timestamp End_Timestamp Reject_Count Status Comments

The batch number assigned for this warehouse refresh process. The name of the code module. The physical system clock timestamp when the code module was initiated. The physical system clock timestamp when the code module finished. The count of any rows of data that were pulled out at this code module. The status of the code module: Running, Failed, Success. Any commentary that is automatically generated or user entered that describes the code module.

ETL_Links_Audit: Batch_Nbr <PK> Process_Name <PK> Link_Name <PK> Row_Count ETL_Table_Audit: Batch_Nbr <PK> Process_Name <PK> Target_Name <PK> Insert_Count Update_Count Reject_Count

The batch number assigned for this warehouse refresh process. The name of the code module. The name assigned to a particular path segment (link) in the code module. The count of rows that flowed through this path segment.

The batch number assigned for this warehouse refresh process. The name of the code module. The affected table or file by this code module The number of rows that are inserted into this target. The number of rows that are updated in this target. The number of rows that were rejected by this target.

Changed data determination


Changed data determination is different than changed data capture. Changed data capture takes place within the framework of a source systems software and hardware, whereas changed data determination occurs in the warehouse ETL process. Changed data capture is when software or hardware means are used to identify rows of data that have undergone a change since the last time the change data capture process run. This is sometimes done using the database transactional logs, and other times through software modifications that log the changes for interested tables to a tracking table. Changed data determination is the identification of different rows when the data has been handed to the ETL process for processing into the warehouse. The point of changed data capture and determination is to reduce the volume of data processing into the data warehouse. Volume reduction directly translates into smaller refresh runtimes. Changed data determination can be accomplished in many ways, and it is up to the ETL architect to determine which technique is appropriate for the given situation. The simplest method for extracting changed data is source only those rows from the source systems that have a timestamp that reflect the last change to that row. This allows the ETL process to keep track of the last valid refresh cycle and selectively query those rows that show a more recent change. This is the first and best solution for reduction of source volume. o The obvious drawback to this method is that every source table must have a last update timestamp. However, the most insidious problem in determining change data is the following: Because a reference table may have a change and the change is only apparent after the reference lookup occurs, a full table extract may still have to occur if the change can only be seen post-transformation. For example, a customer table is mapped into a warehouse party table, and a code value is referenced from another table retrieving another value during the transformation processing for inclusion in the target table. All source customers have to run through the transformation processing, to see if their code value returns a different value on reference, because the change may have occurred in the reference table. If analysis is not thorough before choosing this technique hidden change data may not be processed!

Another method for doing changed data determination is again at the sourcing phase of processing. This method could be called differentiation. Given that the sourcing phase produces a sequential file, it is possible to compare the contents of the file created in the current process to that of the previous process. By doing a bit-wise comparison of the current rows to the previous rows, the incoming data can be reduced to only those rows that actually are different. This method works extremely well for the situations where the source table is being completely scanned because there is no ability to selectively extract only those rows that are different. Very simple code can be used to differentiate the two file sets, and produce the third set that moves on for transformation. This technique has the same consideration as the last technique in that the change may only be apparent after transformation. Both methods will be employed according to the nature of the source data.Another method for doing changed data determination is again at the sourcing phase of processing. This method could be called differentiation. Given that the sourcing phase produces a sequential file, it is possible to compare the contents of the file created in the current process to that of the previous process. Doing a bit-wise comparison of the rows and extracting the rows that are different can reduce the incoming data to only those rows that actually are different. This method works extremely well for the situations where the source table is being completely scanned because there is no ability to selectively extract only those rows that are different. Very simple code can be used to differentiate the two file sets, and produce the third set that moves on for transformation. This

technique has the same consideration as the last technique in that the change may only be apparent after transformation. The last technique is similar to the previous technique, except that this technique is employed after transformation. After all transformation has completed, the resulting sequential load file can be compared to the current contents of the warehouse. This process may be more complex than anticipated, in that multiple rows within the incoming data may affect the same target row. In this case, the differentiation must occur during transformation, as opposed to after all transformation is completed. This causes the transformation process to be much slower and more complex as each row must be committed after modification so that the subsequent row can compare its results against this recently modified row.

10 Process control parallel execution


The opportunities for parallel execution of processes should now be readily apparent because of the process banding approach discussed earlier. The five main parts of ETL can execute under the following parallel strategy: 1. Source -- parallel across all sourcing processes. These processes are typically selecting data from a source system. That selection query is already optimized for parallelism on the source server if that database supports parallelism. The processes are not interdependent, except when some of the selection queries are joining to a previous selection set. 2. Lookup parallel across all lookup processes. The lookup processes are building a snapshot of the target database in the sandbox. This is mostly pulling the full contents of some tables, and subsets of others, into the sandbox. These processes are highly independent, as foreign key constraints are not needed within the sandbox. 3. Transform serial where foreign key dependencies appear, parallel where no foreign keys interfere. 4. Recovery parallel across all recovery processes 5. Load parallel if foreign key constraints removed, else the same as Transform strategy.

11 Process recoverability and restart


The process segmentation facilitates process recoverability at each of the segment transitions. Because the source, lookup, recovery, and load segments have sequential files as results, the process can be restarted easily at these points. If the process stops during transformation processing, than recoverability becomes more difficult as the sandbox may have modified rows. The worst case is that transformation processing begins again by reloading the lookup load files into the sandbox and trying again. The following chart lists potential problems and their restart or recovery points: Problem 1. Unable to select all source tables, access denied 2. Unable to connect to target database 3. Target database out of extent space during bucket table loading Recoverable Point Restart from sourcing Restart from lookup beginning Restart from lookup beginning

4. File system out of disk space during lookup unloading 5. Sandbox out of extent space during lookup loading 6. Transformation processing runs out of disk/extent space 7. Database recoverability process runs out of disk space during recoverability file creation 8. Load processing runs out of disk space while creating load files 9. Load processing runs out of extent space while inserting/updating target database

Restart from lookup unloading Restart from lookup loading Determine if transformation restart capable from failure point, otherwise restart from lookup loading Restart from recoverability file creation Restart from load file creation Restart from load insert/update processing

As demonstrated, the entire ETL process has been designed to have explicit recovery points throughout the process. These restart points allow for recovery procedures to be well documented and easily programmed. In the event of a problem, operations staff has predefined resolutions for most problems. The ETL programming staff can have a restart script for each type of problem, and after resolution the operations staff have an easy time restarting the process. Because each process segment prepares the environment for its portion of the processing, restart is absolutely clean and problem free.

12 Archival
Each process segment is producing sequential files that feed the next processing segment. In addition, exception files, log files, and other miscellaneous files have been created throughout the entire process. These files are highly compressible and should be stored for a predetermined period of time, usually at least a week if the warehouse has a daily refresh cycle, and a month for a weekly refresh cycle. The purpose of the archival is to allow a problem to be researched if it has been noticed within the purge window. The archived files can be moved into the development or test environments in order to recreate the process. Another advantage to archiving after each ETL process run is to provide an image to the next process of what the previous process looked like. This is important when a sourcing process is retrieving the entire contents of a table and is unable to select only those rows that have changed since the last ETL process. After capturing the full state of the table to a sequential file, that file can be compared to the previous runs sequential file, and the new or different rows pulled out and placed into another sequential file. This file then becomes the delta file for the ETL transformation process. Another advantage to the archiving principle comes into play during a database recovery. In the event that the data warehouse database needs to be recovered, the database is restored from backup tape. Then, the system administrators roll transaction logging forward to bring the database back up to the current state. But in a data warehouse environment, transaction logging is unnecessary. The ETL process contains the load files in a load-ready format. If the database is recovered from tape, then the load files for each day simply need to be applied to the database in chronological order to bring the database back up to the current state.

13 Database recoverability
Data warehouses are extremely large databases and often play a critical role in a companys decisionmaking processes. Therefore, the recovery strategy becomes very important in maintaining optimal uptime for the data warehouse. As mentioned in the previous section, the archiving strategy of the load set allows for an easy recovery of the database after a restore from tape. This is an excellent choice given the time to restore the database and roll forward the load sets. An alternative exists that is quicker and cleaner and does not require the database to be restored, thus putting a long downtime into the data warehouse. Several critical pieces play into this technique, but they are easy to implement. The database must have a batch number column in every table as discussed earlier. Since the ETL process produces an insert and an update sequential file for each table, it becomes easy to return the database to its prior state. The rows that were inserted can simply be selected by the batch number column and deleted. The updated rows become a little trickier, because the row has to be un-updated. In the lookup process described later on a technique is shown how the data warehouse is queried and a subset of the data needed for the ETL process is copied into an ETL sandbox. This process is capturing the before image of the rows targeted for update. Since the process unloads the rows to sequential files, these files become the recovery set for returning the database back to its original state after the inserted rows have been deleted.

14 Naming conventions
A typical warehouse integrating 100 source tables into a target star schema warehouse composed of 20 dimensions and five fact tables is going to use about 500 ETL jobs or individual code pieces. Each one of those pieces or jobs needs a meaningful name. A well thought out naming convention can yield some much-needed functionality during impact analysis time. These names could be put together in such a way that the job intent is realized just by looking at the name. For example, the sourcing process can be documented using a naming convention where the job or code piece is named for the tables being sourced as well as the source system where the table resides. This can be extended throughout the entire process. For example:
Function Sourcing Lookup buckets Lookup staging Transformation dimension Transformation fact Load dimension Load fact Prefix Src Lkp Lkp Xfm Xfm Load Load Prefix extension Source System or Database Identifier ABC EDW EDW EDW EDW EDW EDW Source Table Parts Products Products Parts Orders Target Table Example Src_ABC_Parts Lkp_Tmp_EDW_Products Lkp_Bld_EDW_Products Xfm_Dim_Parts_Products Xfm_Fact_Orders_OrderDetails Load_Dim_EDW_Products Load_Fact_EDW_OrderDetails

Tmp Bld Dim Fact Dim Fact

Products OrderDetails Products OrderDetails

This naming convention style lends itself to impact analysis capability. The naming style also sorts itself alphabetically, grouping like purpose jobs or code pieces together. The name is easily parsed to determine the nature of what is going on inside the code. A boilerplate approach to the jobs or code pieces mean that a code walk-thru is almost meaningless in that all jobs or code pieces have the same functionality, just different mapping and metadata.

You might also like