Surrogate Keys

Lecture #5 Surrogate keys Surrogate means an artificial or synthetic product that is used as a substitute for a natural product.
. A natural product is generally costlier than the surrogate product and this forces us to use the later to save money. In a data warehousing environment also, we cannot afford to use the natural keys, as they are expensive in terms of the space they occupy. Moreover, surrogate keys have other advantages as well. In this lecture we are going to have a closer look at these surrogate keys. There are many different terms that are being used for natural keys. Some of them are: Production keys Smart keys Intelligent keys They are called so because they have some information embedded in them about the record they represent. Surrogate keys also have various aliases. Some of them are: Meaningless keys Integer keys Non-natural keys Artificial keys It is strongly recommended to use surrogate keys in dimensional models rather than relying on the operation production keys. As designers of operational systems we have been trained to incorporate as much information into the keys (making keys out of the given data). Surrogate key are integers that are assigned sequentially to as needed to populate a dimension. For example, the first product record is assigned a product surrogate key with the value of 1, the next product record is assigned the value 2 and so on. We use 4 bytes for a surrogate key (is it sufficient??). Surrogate keys are merely used to join the dimension tables to the fact tables. Advantages of Surrogate keys: Buffer the data warehouse from operational changes. Space saving Provide performance advantages Enable efficient handling of changes to dimension tables. We will look at these advantages in detail later. First let us try to understand how surrogate keys are generated before data is loaded into the data warehouse. Keys for the Dimension Tables All the records in all the dimension tables must be assigned a surrogate key. This task of assigning of surrogate keys can be divided into modules 1. First time loading of a dimension table 2. All subsequent loads
Dr. Navneet Goyal, BITS, Pilani Warehousing
Page 1 of 4
Data
Dimension tables, as we all know, have single field primary keys, which are assigned to them in the operational systems. For example, product ID, customer ID, etc. Suppose that we want to bring in the product master table. Here we assume that the product data is clean and there is only one record for a product (SKU), that is, de-duplication of data has already been done in the data staging area (DSA). For each record of the master table, simply assign surrogate keys (integers starting from 1) sequentially. This simple process is just a sequential read of the incoming data. Production Key Prod 1 Prod 2 Prod 3 and so on.. Surrogate Key 1 2 3
Any subsequent load of product data is not as simple as the first load. All the production keys must be checked to see if they have been encountered before. This can be done by comparing the incoming production key with the production key that is there in the product dimension as an ordinary field. However, keeping a separate lookup table that contains the mapping between the production keys and the surrogate keys is recommended. This lookup table can be suitably indexed in order to speed up the lookup process. If the incoming key is not there in the lookup table, simply assign it a new surrogate key and update the lookup table. If you have seen the production key before and all the attribute values are the same, then simply ignore this record. And if some attribute values have been changed, assign a new surrogate key to this production key. This is done by creating a new dimension record with the same natural key but a new surrogate key(Type II change*). We can also overwrite the existing record and retain the same surrogate key (Type I change*). The second and any subsequent load of the dimension table can be understood more clearly by looking at the following example: Suppose the lookup table after the first load looks like this Production Key Prod1 Prod2 Prod3 Prod4 Prod5 Surrogate Key 1 2 3 4 5
Now if a product record with production key Prod6 comes, simply assign a new surrogate key to it. This leads to the addition or a new record in the look up table (Prod6, 6). If a product record with production key Prod1 comes, we will have to check all the attribute values with the values in the dimension table. If all the values match, simply ignore this record. Finally if a product record with production key Prod1 comes and one or some attribute values are changed, assign a new surrogate key (=7) to it and update the surrogate key
*
refer to lecture #6 on Slowly Changing DImensions
Page 2 of 4
Data
field in the lookup table to 7. Also a new record in the dimension table needs to be created. This new record will have the surrogate key 7 and production key Prod1. The new lookup table would look like this: Production Key Prod1 Prod2 Prod3 Prod4 Prod5 Prod6 Surrogate Key 7 2 3 4 5 6
Note: Can we implement a Type 2 change without surrogate keys? If the changes in the production data are marked and timestamped, then there is no need to perform the expensive field-by-field comparison. Keys for the Fact Tables Once the dimension tables and their corresponding lookup tables are in place, we can start loading the fact table records into the fact table. Processing of fact table records is simple. Simply take the production keys in the fact table one by one and look for the corresponding surrogate key value from the lookup table. Replace the production key with the surrogate key and you are done (note that the lookup table only contains the latest surrogate key). Consider the following sales record: (location1, prod1, time3, units, amount) This record will be enter the sales fact table as: (1, 6, 3, units, amount) where 1 is the surrogate key for location1, 6 is the surrogate key for prod1, and 3 is the surrogate key for time3. The facts units and amount remain unchanged. Now as we have understood how the assignment of surrogate keys takes place, let us now look at the advantages of using surrogate keys. Buffering the Data Warehouse from Operational changes In many organizations, obsolete or inactive production codes are reused. For example, a bank may reuse an account number if it has been inactive for a long period of time or that account has been closed by the customer. Similarly obsolete product codes could be reused by a grocery store. Such changes do not affect the operational systems, as operational systems do not keep historical data. A data warehouse on the other hand retains data for years and reuse of codes can cause a conflict. Use of surrogate keys helps us create a new record in the dimension table that will have different values in most of the attributes.
Page 3 of 4
Data
Space Saving As surrogate keys are only 4 bytes long, they occupy less space that the bulky alphanumeric production keys. For example the date data type occupies 8 bytes. So by using surrogate keys we save 4 bytes. If we have 1 billion records in the fact table, we would be saving 4x1billion bytes = 3.73 GB of space!!!! Performance Advantages This is mainly a consequence of the above advantage. Use of surrogate keys lead to smaller fact tables, and smaller index files for fact tables. This in turn leads to more fact table records in one block thereby facilitating their processing. Efficient Handling of Changing Dimensions As we have already seen, the use of surrogate keys allow us to implement the Type II changes which would not have been possible using just production keys. The above material has been compiled from the following sources. The idea is to put forth the concepts in a simple way. 1.Kimball Ralph, Surrogate Keys, DBMS Online, May 1998 www.dbmsmag.com/9805d05.html 2. Kimball Ralph, Pipelining Your Surrogates, DBMS Online, June 1998 3. Kimball Ralph and Margy Ross, The Data Warehouse Toolkit, 2e, John Wiley, 2002. 4. Breck Carter, Intelligent vs. Surrogate Keys, www.bcarter.com/intsurr1.html
Page 4 of 4
Data

Surrogate Keys

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Surrogate Keys

Uploaded by

Copyright:

Available Formats

Lecture #5 Surrogate keys Surrogate means an artificial or synthetic product that is used as a substitute for a natural product.

Dr. Navneet Goyal, BITS, Pilani Warehousing

refer to lecture #6 on Slowly Changing DImensions

Dr. Navneet Goyal, BITS, Pilani Warehousing

Dr. Navneet Goyal, BITS, Pilani Warehousing

Dr. Navneet Goyal, BITS, Pilani Warehousing

You might also like