You are on page 1of 5

Change Data Capture using MD5

The purpose of this document is to brief about the implementing the change data capture using MD5 function.

Table of Contents
Introduction: ................................................................................................................................................. 3 What does MD5 Do? ..................................................................................................................................... 3 Syntax ........................................................................................................................................................ 3 Return Value ............................................................................................................................................. 3 SQL Override ................................................................................................................................................. 4 Expression ..................................................................................................................................................... 4 exp_INITIAL: .............................................................................................................................................. 4 ACTION_CODE:.......................................................................................................................................... 4 ETL_MD5_CHECKSUM: ............................................................................................................................. 4 ROUTER ......................................................................................................................................................... 5 Explanation ................................................................................................................................................... 5 Conclusion ..................................................................................................................................................... 5

Introduction:
If we want to write changed data to a database, use MD5 to generate checksum values for rows of data you read from a source. When you run a session, compare the previously generated checksum values against the new checksum values. Then, write the rows with updated checksum values to the target. You can conclude that an updated checksum value indicates that the data has changed.

What does MD5 Do?


Calculates the checksum of the input value. The function uses Message-Digest algorithm 5 (MD5). MD5 is a one-way cryptographic hash function with a 128-bit hash value. You can conclude that input values are different when the checksums of the input values are different. Use MD5 to verify data integrity.

Syntax
MD5( value ) Required/ Optional

Argument

Description Value for which you want to calculate checksum. The case of the input value affects the return value. For example, MD5 (informatica) and MD5 (Informatica) return different values.

value

Required

Return Value
Unique 32-character string of hexadecimal digits 0-9 and a-f. NULL if the input is a null value. The flow looks something like below to capture the CDC data.

Lets understand how MD5 works, if we dont have a date column to capture the change data. Here we will have two sources, one is the main source and other will be your target (_L). (Both will have the same structure).

Then will do a full outer join on the two sources based on the PK columns. Then in the expression will decide the ACTION_CODE (Insert /Update/Delete).

SQL Override
Select * from <Table Name> where ACTION_CODE <> 'D' AND ETL_CREATED_DATE = (Select max(lt1.ETL_CREATED_DATE) FROM <Table Name> lt1 WHERE A.Key=lt1.Key) Here, we are filtering the records whose ACTION_CODE=D and also extracting the records whose based on the ETL_CREATED_DATE.

Expression
exp_INITIAL:
The input fields need to be consistent before passing to the MD5 function. For e.g. If there are leading/trailing spaces in any of the fields, those need to be Trimmed. This is because MD5 generates different values when there are leading/trailing spaces. So, MD5(John) and MD5( John) will be different. So we need to apply LTRIM & RTRIM function to the columns before sending those to MD5.

ACTION_CODE:
IIF (NOT ISNULL(KEY_SRC) and ISNULL(KEY_L),'I',IIF (ISNULL(KEY_SRC) AND NOT ISNULL(KEY_L), 'D',IIF (v_ETL_MD5_CHECKSUM <> ETL_MD5_CHECKSUM_L and KEY_L= KEY_SRC, 'U')))

KEY_SRC: Key from the Source KEY_L: Key from the Landing Source ETL_MD5_CHECKSUM: Unique 32-character string generated from the I/P values from the source. ETL_MD5_CHECKSUM_L: Column from the Landing source which contains the Unique 32 character string.

ETL_MD5_CHECKSUM:
MD5 produces same checksum when the values of two adjacent fields are swapped and one of them is blank. For e.g. lets say we have two non-key columns, Col1 and Col2. Suppose, Col1 value is Peter and Col2 is blank. MD5(Col1||Col2) will give us some unique value. In the subsequent load, if we get blank in Col1 and Peter in Col2 then MD5(Col1||Col2) will give us the same value as generated in the previous load. So it will fail to capture the changed data. To resolve this, its always safe to include an extra character (which is unexpected to come from source) in between two columns. For eg: we can include a caret ^ as MD5(Col1 || ^ || Col2 || ^ || Col3). MD5 will generate different values in the above scenario if we follow this approach

Syntax: MD5(Col1 || ^ || Col2 || ^ || Col3) Note: If you have non string columns such as SMALLINT, INT, NUMBER etc.. you need to convert it using TO_CHAR function.

ROUTER
Now, in the router we can route the data based on the requirement. In this case, we are sending all Inserts and updates to one target and deletes into other target based on the ACTION_CODE.

Explanation
In the first run, the landing source will be NULL. So all the records coming from Source will be loaded into the target. In the Next run assume, we get a new record in the source now the keys present in source will not be null and the keys in landing will be null. Hence the record will be inserted into the Landing target with ACTION_CODE=I. Now in other run, for example we have an update in the non-key columns of the source for an existing row, then it will compare the MD5 of the already existing data with the MD5 generated with the new set of data. Now the MD5 values changes. Hence the record will be inserted into the Landing target with ACTION_CODE=U. Now in source we have some hard deletes, then the keys present in source will null and the keys in landing will not be null. Then, the record will be inserted into landing target with ACTION_CODE=D.

Conclusion
Advantage of using MD5 function is that, it will reduce overall ETL run time and also reduces cache memory usage by caching only required fields which are utmost necessary.

Hence, the change data can be captured using MD5 fucntion.

You might also like