You are on page 1of 20

Data Quality

A Syntel White Paper On

Change Data Capture (CDC)

December, 2007

Confidential 2007 Syntel, Inc

Change Data Capture

Table of Contents
1

Change Data Capture: An Introduction............................................1

Why CDC?......................................................................................2

CDC classifications.........................................................................3

CDC Implementation.......................................................................4

CDC Tools.....................................................................................14

CDC Consideration......................................................................15

CDC Benefits to business............................................................17

About Syntel......................................................................................18

Confidential 2007 Syntel, Inc

Change Data Capture

1 Change
Data
Introduction

<< You can add


some interesting
things in these
margins like some
facts, quotes,
important points,
etc >>

Capture:

An

CDC is the process of identifying the changed data by a system, from a


system that is changed from a previous point in time, in order to take an
action based on the changed data. The system that contains the changed
data is the source and the system that identifies the changed data and
performs an action, is the target. The source and the target may be the same
system physically or might be geographically separated and can be of
heterogeneous technologies.
CDC can reduce the volume of extracted data to enable data integration
scalability and can speed up data integration cycles to enable fast-paced
business activities. With rapid increase in data volumes and the pace of
business, CDC is an important enabler for data integration scalability and
right-time business.

Confidential 2007 Syntel, Inc

Change Data Capture

2 Why CDC?
Primary problems in data integration, data synchronization, data
propagation and ETL & Data Warehousing projects today are
Exponential Data Growth
Ever-increasing volumes of data in operational databases put everincreasing loads on data integration processes.
Real-time operation
Although critical data could change in a database at any moment, the
majority of data integration processes run overnight, which limits how
quickly and frequently a business can act on the data.
Shrinking integration windows
In companies moving toward the around-the-clock operations of a global
business, traditional nightly windows for IT administration and integration
are shrinking or have closed.
CDC can overcome these problems by:
Reducing the amount of data extracted from operational databases
Nonselective methods like table dumps or storage snapshots extract too
much data, which puts a load on a downstream integration server where the
data is filtered to reveal the changes since the last extract. With CDC, the
filtering task is distributed upstream to the data source, which results in less
data to move. With less to integrate, networks and downstream servers for
data transformation or data quality carry a lighter load, thereby giving the
total data integration architecture more room to scale up.
Providing more frequent opportunities to act on data
Instead of waiting overnight for data integration processes to deliver
information to its consumers, CDC enables business processes to act on
new or changed data multiple times daily in support of time-sensitive
business practices like performance management, activity monitoring, or
customer service.
Replacing lengthy load windows
CDC extracts data incrementally as it appears or changes in a DBMS, thus
alleviating the need to extract data in bulk during integration windows that
are gradually disappearing.

Confidential 2007 Syntel, Inc

Change Data Capture

3 CDC classifications
CDC can be implemented in two ways, namely
1. Event driven Push approach
2. Interval driven Pull approach

Event driven or Push approach


Updates are applied in response to an event on the data source. Change
capture agents identify and send changes to the target system as soon as the
changes occur. This model enables customers to update their analytical
applications on-demand with the latest information.

Interval driven or pull approach


Updates are applied at regular intervals in response to requests from the
target. Requests might occur every five minutes or every five days. The size
of the interval is usually based on the volatility of the data and the latency
requirements of the application.

Confidential 2007 Syntel, Inc

Change Data Capture

4 CDC Implementation
CDC implementation can be approached in two ways, namely
1. Home grown solutions implementation- Involves coding
a. Source Timestamps All RDBMS
b. Source Triggers All RDBMS
c. Database Logs Oracle 10G

2. Tool implementation GUI based with minimal coding

Source Timestamps Timestamps on rows


Source system tables have a placeholder for the last updated timestamp for
each row. Every time a row is added or updated this column should be
updated with the current timestamp value.
Control tables, as part of load metadata are maintained by the ETL system.
(Refer table 1 for the metadata of this table) Before every load regardless of
the frequency, a row is inserted in the control table that indicates that the
process has started and it is urrently running. The following fields will be
loaded process_id, process_nm, process_start_tms, record_start_tms,
record_end_tms and status_flg.
If a particular process is running for the first time (this can be identified by
checking if there are any records for this process name in the table), then
the record_start_tms field will be loaded with a pre date such as
01/01/1900. The record_end_tms field will be loaded with the current
timestamp. The status_flg field will be loaded with the value F. The
process_id field will be a running sequence number, the process_nm field
will be loaded with the name of the process. The process_start_tms filed will
be loaded with the current timestamp. In case the particular process is
running for the nth time then the record_start_tms field will be loaded with
the value equivalent to the value stored in the record_end_tms field of the
record, which was loaded by this process when it ran the (n-1) th time. This
will be the only change and everything else will just remain the same.

Once the process completes the record that was inserted during the
beginning of the process will be updated and the following fields will be
updated process_end_tms and status_flg. The process_end_tms field will

Confidential 2007 Syntel, Inc

Change Data Capture

be loaded with the current timestamp. The status flag will be overwritten
with the value P indicating that the process has completed successfully.
While the process is running at any given point in time, there will be only
record in this table for that process name with status flag F and process
end timestamp as null. All mappings/jobs/graphs/procedures that run as
part of the process will extract records from the respective source tables
that have changed from the time of record_start_tms of this process till the
time of record_end_tms of this process. This way we ensure that for each
load we are able to capture only the changed data from the source systems.
Load control table structure:

Column Name

Column Type

Description

process_id

Number(30)

Sequence generated number uniquely


identifying a record

process_nm

Varchar(30)

Name of the process

process_start_tms

Timestamp

The time the process started, current


time

process_end_tms

Timestamp

The time the process ended, current time

record_start_tms

Timestamp

The time from which the source record


changes are captured

record_end_tms

Timestamp

The time till which the source record


changes are captured

Status_flg

Char(1)

The process completion status (P, F)


Table 1

Source Triggers
Triggers are created on each of the source tables from which data is
extracted. Triggers are created for after_insert, after_update and
after_delete scenarios.
For every source table a new change table is created with just the primary
key columns and two additional columns, namely action_flg and target_cd.
Action_flg column will be of the datatype char(1). The target_cd column will
be of the datatype Varchar(30).

Confidential 2007 Syntel, Inc

Change Data Capture

The action_flg column will store one of the following values

I Insert record

U Update record

D Delete record.

The target_cd column will store the name of the target system, which is
using this change table. If more than one target system is interested in
capturing the changes occurring in one source table, this column will help
in identifying the same.
Each time an operation is performed on the source table, the triggers will
write a record(s) into the corresponding change table with the primary key
values. The number of records written per change is based on the number
of target systems that subscribe for the change capture in the source table.
It is possible that not all changes to the source table will be subscribed by
every target system, they can subscribe for insert, update and delete
changes or one of the operations or a combination of operations. Te target
systems can also subscribe for changes pertaining to a single or a
combination of columns or the whole row. The after_insert trigger will load
I in the action_flg column, while the after_update trigger will load the value
U and the after_delete trigger will load the value D. This flag will help us
in identifying the change that was performed on the record.
Similarly for each of the source table and target system combination that
subscribes for the change capture, a view will be created which will join the
source table with the respective change table using the primary keys. The
structure of the view will be similar to the source table. The only additional
column in the view will be the action_flg, which helps in classifying the
changed record.

Once the target system has extracted/responded the change, it has to purge
those records in the change table (should essentially be done as soon as the
view is formed), in order for subsequent change capture. The target system
should purge all records in the change table for which that system happens
to be the subscriber.

Database logs Asynchronous (Supported only in Oracle 10G)


In this mode the changes are captured from the database redo log files after
changes have been committed to the source database. This mode of CDC is
dependent on the level of supplemental logging enabled at the source

Confidential 2007 Syntel, Inc

Change Data Capture

database. Supplemental logging adds redo logging overhead at the source


database.
Before we get into the details we need to understand the following
terminologies which will be extensively used in the following sections.
Publisher The person/system which captures and publishes the changed
data.
Subscriber One/multiple application(s) or individual(s) that access the
changed data.
Source database The production database that contains the data of
interest.
Change Source This is a logical representation of the source database.
Staging database This is the database to which the captured change data
is applied.
Change table - This is a relational table into which change data for a single
source table is loaded. To subscribers, a change table is known as a
publication.
Change set A set of change data that is guaranteed to be transactionally
consistent. This contains one or more change tables.

There are three ways by which the asynchronous CDC can be achieved
Asynchronous hotlog mode
Asynchronous distributed hotlog mode
Asynchronous autolog mode

Asynchronous hotlog mode


In the asynchronous hotlog mode, change data is captured from the online
redo log file on the source database. There is a brief latency between the act
of committing source database trannsactions and the arrival of the change
data.
There is a single, predefined hotlog change source, HOTLOG_SOURCE, that
represents the current online redo log files of the source database. This is
the only hotlog change source and cannot be altered or dropped. Change
tables for this mode of change data capture must reside locally in the source
database.

Confidential 2007 Syntel, Inc

Change Data Capture

Figure 1 illustrates the asynchronous hotlog configuration. The loqwriter


process (LGWR) records commited transactions in the onlie redo log files on
the source database. CDC uses oracle streams to automatically populate the
change tables in the change sets within the HOTLOG_SOURCE change
source as newly committed transactions arrive.

Figure 1

Asynchronous distributed hotlog


In the asynchronous distributed hotlog mode, change data is captured from
the online redo log file on the source database. There is no predefined
distibuted hotlog change source. Unlike other modes of CDC, this mode
splits CDC activities and objects across the source and staging database.
Change sources are defined on the source databse by the staging database
publisher.
A distributed hotlog change source represents the current online redo log
files of the source database. However, staging database publishers can
define multiple distributed hotlog change sources, each of which contains
change sets on a different staging database. The source and staging
database can be on different hardware platforms and be running different
operating systems.

Confidential 2007 Syntel, Inc

Change Data Capture

Figure 2 illustrates the asynchronous distributed hotlog configuration. The


change source on the source database captures change data from the online
redo log files and used streams to propogate it to the change set on the
staging datanase. The change set on the staging database populates the
change tables within the change set.
There are two publishers required for this mode of change data capture,
one on the source and other on the staging database. The source dtaabase
publisher defines a database link on the source database to connect to the
staging database as the staging database publisher. The staging database
publisher defines a database link on the staging database to connect to the
source database on the source database publisher. All publishing
operations are performed by the staging database publisher.

Figure 2

Asynchronous autolog mode


In this mode, change data is captured from a set of redo log files managed
by redo transports services. Redo transport services control the automated
transfer of redo log files from the source database to the staging database.
Using database intilialization parameters the publisher configures redo
transports services to copy the redo log files from the source database

Confidential 2007 Syntel, Inc

Change Data Capture

system to the staging database system and to automatically register the redo
log files. Asynchronous autolog mode can obtain change data from either
the source database online redo log or from source database archive redo logs.
These options are called asynchronous autolog online and asynchronous
autolog archive. With the autolog online option, redo transport services is
set to copy redo data from the online redo log at the source database to the
standby redo log at the staging database. Change sets are populated after
individual source database transactions commit. There can only be one
AutoLog online change source on a given staging database and it can
contain only one change set. With the AutoLog archive option, redo
transport services is set up to copy archived redo logs from the source
database to the staging database. Change sets are populated as new
archived redo log files arrive on the staging database. The degree of latency
depends on the frequency of redo log file switches on the source database.
The AutoLog archive option has a higher degree of latency than the AutoLog
online option, but there can be as many AutoLog archive change sources as
desired on a given staging database.
There is no predefined AutoLog change source. The publisher provides
information about the source database to create an AutoLog change source.
Figure 3 shows a CDC asynchronous AutoLog online configuration in which
the LGWR process on the source database copies redo data to both the
online redo log file on the source database and to the standby redo log files
on the staging database as specified by the LOG_ARCHIVE_DEST_2
parameter. (Although the image presents this parameter as
LOG_ARCHIVE_DEST_2, the integer value can be any value between 1 and
10.)
Note that the LGWR process uses Oracle Net to send redo data over the
network to the remote file server (RFS) process. Transmitting redo data to a
remote destination requires uninterrupted connectivity through Oracle Net.
On the staging database, the RFS process writes the redo data to the
standby redo log files. Then, CDC uses Oracle Streams downstream capture
to populate the change tables in the change sets within the AutoLog change
source.
The source database and the staging database must be running on the same
hardware, operating system, and Oracle version.

Confidential 2007 Syntel, Inc

Change Data Capture

Figure 3

Figure 4 shows a typical Change Data Capture asynchronous AutoLog


archive configuration in which, when the redo log file switches on the
source database, archiver processes archive the redo log file on the source
database to the destination specified by the LOG_ARCHIVE_DEST_1
parameter and copy the redo log file to the staging database as specified by
the LOG_ARCHIVE_DEST_2 parameter. (Although the image presents these
parameters as LOG_ARCHIVE_DEST_1 and LOG_ARCHIVE_DEST_2, the
integer value in these parameter strings can be any value between 1 and 10.)
Note that the archiver processes use Oracle Net to send redo data over the
network to the remote file server (RFS) process. Transmitting redo log files
to a remote destination requires uninterrupted connectivity through Oracle
Net.
On the staging database, the RFS process writes the redo data to the copied
log files. Then, Change Data Capture uses Oracle Streams downstream
capture to populate the change tables in the change sets within the AutoLog
change source.

Confidential 2007 Syntel, Inc

Change Data Capture

Figure 4

Confidential 2007 Syntel, Inc

Change Data Capture

5 CDC Tools
Popular CDC tools available in the market
Tool Name

Support for

Informatica
PowerExchange

Netweaver, Salesforce, Adabas, Datacom, DB2, IDMS,


IMS DB, SQL Server, Oracle, VSAM

Websphere DataStage SQL Server, Oracle, DB2, IMS


Change Data Capture
Attunity Stream

SQL Server, Oracle, DB2, VSAM, IMS, DB2/400,


Adabas, Enscribe, SQL/MP

Oracle asynchronous Oracle 10G


CDC
Sybase ASE Real Sybase
time data services
Connx Datasync

RMS, Oracle, DB2, Sybase, Rdb, DBMS, C-ISAM,


Informix, Micro Focus, My SQL, SQL Server, VSAM,
IMS, DataFlex, POWERFlex, Adabas

DataMirror

Oracle, UDB, DB2, SQL Server

GoldenGate

Oracle, SQL Server, DB2, UDB, Sybase, Enscribe,


SQL/MP, SQL/MX, Teradata

Confidential 2007 Syntel, Inc

Change Data Capture

Table 2

Confidential 2007 Syntel, Inc

Change Data Capture

6 CDC Consideration
Points to be considered for CDC home grown solution implementation
1. Schema changes
Changing/Addition of the source system tables is not always well
accepted by the existing system administrators. CDC by source
timestamps involves a schema change of adding a new timestamp field.
CDC by triggers requires additional tables and views to be created in
order to capture the changed data.
2. Minimal overhead on existing application
The CDC when implemented should not degrade the existing systems
performance beyond a mutually acceptable SLA. CDC by timestamps
involves an additional operation of logging the timestamp in each
record. CDC by triggers will involve additional operation of inserting the
changed records primary key fields into the change table for any action
performed on the source table. These will marginally increase the time
taken by the existing application to perform the same task. This scenario
during more than a million record changes a day will become a marginal
overhead.
3. Physical vs. Virtual logging
Logging can be done in two ways. One is physically logging changes i.e.
copying each record to the log. Another is virtual approach, which
maintains a list of pointers to changed records without copying their
contents. Physical logging increases load on the source systems and
need disk storage. It is very fast for the target system to access and
apply. Virtual logging is much easier on the source system, but increases
the load for the target system to access the change.
4. Latency
The extent to which the target systems can wait, before reflecting the
changed data. Some systems want this to be real time, some can accept
latency for hour(s) and some can wait for a day.
5. Cost & Time
The budget and the timeframe within which the CDC has to be
implemented. Tools available in the market are expensive than the
solution approaches. The implementation time for the tools is much
lesser than the solution approaches.

Confidential 2007 Syntel, Inc

Change Data Capture

Key features to be considered for a CDC tool implementation


1. Non Intrusive
2. Low operational overhead
3. Heterogeneous environment (multiple source/target environments)
4. Reliable (guaranteed delivery, fail over and recoverability)
5. Performance and throughput
6. Ease of use
7. High volumes
8. Batch & Near real time (right time) delivery
9. Integration with ETL/EAI tools
10. Metadata Management

Parties involved in the CDC decision

Business users of the downstream system

Operational users of the upstream system

Administrator (Database) of the upstream system

Design lead of the upstream system

Confidential 2007 Syntel, Inc

Change Data Capture

7 CDC Benefits to business


Benefits that the business can derive get from implementing the CDC has
been described below.
1

Less extracted data means more data integration scalability


As data volumes swell and integration windows shrink, CDCs ability to
reduce the amount of extracted data (as well as to produce data that
requires less downstream processing) will be increasingly important to
achieving ever-increasing scalability for data integration processes.

Capturing change as it occurs means businesses can react sooner


As the pace of business continues to accelerate, CDC becomes yet
another option for recognizing time-sensitive events (like a customer
interaction, inventory outage, or shortfall in sales), so managers can
react quickly to correct a problem or leverage an opportunity.

Confidential 2007 Syntel, Inc

Change Data Capture

About Syntel
Syntel is a global Applications Outsourcing and e-Business company that
delivers real-world technology solutions to Global 2000 corporations.
Founded in 1980, Syntel's portfolio of services includes complex application
development, management, product engineering, and enterprise
application integration services, as well as e-Business development and
integration, wireless solutions, data warehousing, CRM, and ERP. We
maximize outsourcing investments through an onsite/offshore Global
Delivery Service, increasing the efficiency of how complex IT projects are
delivered. Syntel's global approach also makes a significant and positive
impact on speed-to-market, budgets, and quality. We deploy a custom
delivery model that is a seamless extension of your IT organization to fit
your business goals and a proprietary knowledge transfer methodology to
guarantee knowledge continuity.
SYNTEL INC.
525 E. Big Beaver, Third Floor
Troy, MI 48083
Phone: 248.619.3503
info@syntelinc.com

2007 Syntel Limited


ALL RIGHTS RESERVED
Copyright in the whole and part of this Change Data Capture paper belongs
to Syntel Limited. This work may not be used, sold, transferred, adapted,
abridged, copied or reproduced in whole or in part in any manner or form or
in any media without the prior written consent of Syntel.

Confidential 2007 Syntel, Inc

You might also like