Professional Documents
Culture Documents
December, 2007
Table of Contents
1
Why CDC?......................................................................................2
CDC classifications.........................................................................3
CDC Implementation.......................................................................4
CDC Tools.....................................................................................14
CDC Consideration......................................................................15
About Syntel......................................................................................18
1 Change
Data
Introduction
Capture:
An
2 Why CDC?
Primary problems in data integration, data synchronization, data
propagation and ETL & Data Warehousing projects today are
Exponential Data Growth
Ever-increasing volumes of data in operational databases put everincreasing loads on data integration processes.
Real-time operation
Although critical data could change in a database at any moment, the
majority of data integration processes run overnight, which limits how
quickly and frequently a business can act on the data.
Shrinking integration windows
In companies moving toward the around-the-clock operations of a global
business, traditional nightly windows for IT administration and integration
are shrinking or have closed.
CDC can overcome these problems by:
Reducing the amount of data extracted from operational databases
Nonselective methods like table dumps or storage snapshots extract too
much data, which puts a load on a downstream integration server where the
data is filtered to reveal the changes since the last extract. With CDC, the
filtering task is distributed upstream to the data source, which results in less
data to move. With less to integrate, networks and downstream servers for
data transformation or data quality carry a lighter load, thereby giving the
total data integration architecture more room to scale up.
Providing more frequent opportunities to act on data
Instead of waiting overnight for data integration processes to deliver
information to its consumers, CDC enables business processes to act on
new or changed data multiple times daily in support of time-sensitive
business practices like performance management, activity monitoring, or
customer service.
Replacing lengthy load windows
CDC extracts data incrementally as it appears or changes in a DBMS, thus
alleviating the need to extract data in bulk during integration windows that
are gradually disappearing.
3 CDC classifications
CDC can be implemented in two ways, namely
1. Event driven Push approach
2. Interval driven Pull approach
4 CDC Implementation
CDC implementation can be approached in two ways, namely
1. Home grown solutions implementation- Involves coding
a. Source Timestamps All RDBMS
b. Source Triggers All RDBMS
c. Database Logs Oracle 10G
Once the process completes the record that was inserted during the
beginning of the process will be updated and the following fields will be
updated process_end_tms and status_flg. The process_end_tms field will
be loaded with the current timestamp. The status flag will be overwritten
with the value P indicating that the process has completed successfully.
While the process is running at any given point in time, there will be only
record in this table for that process name with status flag F and process
end timestamp as null. All mappings/jobs/graphs/procedures that run as
part of the process will extract records from the respective source tables
that have changed from the time of record_start_tms of this process till the
time of record_end_tms of this process. This way we ensure that for each
load we are able to capture only the changed data from the source systems.
Load control table structure:
Column Name
Column Type
Description
process_id
Number(30)
process_nm
Varchar(30)
process_start_tms
Timestamp
process_end_tms
Timestamp
record_start_tms
Timestamp
record_end_tms
Timestamp
Status_flg
Char(1)
Source Triggers
Triggers are created on each of the source tables from which data is
extracted. Triggers are created for after_insert, after_update and
after_delete scenarios.
For every source table a new change table is created with just the primary
key columns and two additional columns, namely action_flg and target_cd.
Action_flg column will be of the datatype char(1). The target_cd column will
be of the datatype Varchar(30).
I Insert record
U Update record
D Delete record.
The target_cd column will store the name of the target system, which is
using this change table. If more than one target system is interested in
capturing the changes occurring in one source table, this column will help
in identifying the same.
Each time an operation is performed on the source table, the triggers will
write a record(s) into the corresponding change table with the primary key
values. The number of records written per change is based on the number
of target systems that subscribe for the change capture in the source table.
It is possible that not all changes to the source table will be subscribed by
every target system, they can subscribe for insert, update and delete
changes or one of the operations or a combination of operations. Te target
systems can also subscribe for changes pertaining to a single or a
combination of columns or the whole row. The after_insert trigger will load
I in the action_flg column, while the after_update trigger will load the value
U and the after_delete trigger will load the value D. This flag will help us
in identifying the change that was performed on the record.
Similarly for each of the source table and target system combination that
subscribes for the change capture, a view will be created which will join the
source table with the respective change table using the primary keys. The
structure of the view will be similar to the source table. The only additional
column in the view will be the action_flg, which helps in classifying the
changed record.
Once the target system has extracted/responded the change, it has to purge
those records in the change table (should essentially be done as soon as the
view is formed), in order for subsequent change capture. The target system
should purge all records in the change table for which that system happens
to be the subscriber.
There are three ways by which the asynchronous CDC can be achieved
Asynchronous hotlog mode
Asynchronous distributed hotlog mode
Asynchronous autolog mode
Figure 1
Figure 2
system to the staging database system and to automatically register the redo
log files. Asynchronous autolog mode can obtain change data from either
the source database online redo log or from source database archive redo logs.
These options are called asynchronous autolog online and asynchronous
autolog archive. With the autolog online option, redo transport services is
set to copy redo data from the online redo log at the source database to the
standby redo log at the staging database. Change sets are populated after
individual source database transactions commit. There can only be one
AutoLog online change source on a given staging database and it can
contain only one change set. With the AutoLog archive option, redo
transport services is set up to copy archived redo logs from the source
database to the staging database. Change sets are populated as new
archived redo log files arrive on the staging database. The degree of latency
depends on the frequency of redo log file switches on the source database.
The AutoLog archive option has a higher degree of latency than the AutoLog
online option, but there can be as many AutoLog archive change sources as
desired on a given staging database.
There is no predefined AutoLog change source. The publisher provides
information about the source database to create an AutoLog change source.
Figure 3 shows a CDC asynchronous AutoLog online configuration in which
the LGWR process on the source database copies redo data to both the
online redo log file on the source database and to the standby redo log files
on the staging database as specified by the LOG_ARCHIVE_DEST_2
parameter. (Although the image presents this parameter as
LOG_ARCHIVE_DEST_2, the integer value can be any value between 1 and
10.)
Note that the LGWR process uses Oracle Net to send redo data over the
network to the remote file server (RFS) process. Transmitting redo data to a
remote destination requires uninterrupted connectivity through Oracle Net.
On the staging database, the RFS process writes the redo data to the
standby redo log files. Then, CDC uses Oracle Streams downstream capture
to populate the change tables in the change sets within the AutoLog change
source.
The source database and the staging database must be running on the same
hardware, operating system, and Oracle version.
Figure 3
Figure 4
5 CDC Tools
Popular CDC tools available in the market
Tool Name
Support for
Informatica
PowerExchange
DataMirror
GoldenGate
Table 2
6 CDC Consideration
Points to be considered for CDC home grown solution implementation
1. Schema changes
Changing/Addition of the source system tables is not always well
accepted by the existing system administrators. CDC by source
timestamps involves a schema change of adding a new timestamp field.
CDC by triggers requires additional tables and views to be created in
order to capture the changed data.
2. Minimal overhead on existing application
The CDC when implemented should not degrade the existing systems
performance beyond a mutually acceptable SLA. CDC by timestamps
involves an additional operation of logging the timestamp in each
record. CDC by triggers will involve additional operation of inserting the
changed records primary key fields into the change table for any action
performed on the source table. These will marginally increase the time
taken by the existing application to perform the same task. This scenario
during more than a million record changes a day will become a marginal
overhead.
3. Physical vs. Virtual logging
Logging can be done in two ways. One is physically logging changes i.e.
copying each record to the log. Another is virtual approach, which
maintains a list of pointers to changed records without copying their
contents. Physical logging increases load on the source systems and
need disk storage. It is very fast for the target system to access and
apply. Virtual logging is much easier on the source system, but increases
the load for the target system to access the change.
4. Latency
The extent to which the target systems can wait, before reflecting the
changed data. Some systems want this to be real time, some can accept
latency for hour(s) and some can wait for a day.
5. Cost & Time
The budget and the timeframe within which the CDC has to be
implemented. Tools available in the market are expensive than the
solution approaches. The implementation time for the tools is much
lesser than the solution approaches.
About Syntel
Syntel is a global Applications Outsourcing and e-Business company that
delivers real-world technology solutions to Global 2000 corporations.
Founded in 1980, Syntel's portfolio of services includes complex application
development, management, product engineering, and enterprise
application integration services, as well as e-Business development and
integration, wireless solutions, data warehousing, CRM, and ERP. We
maximize outsourcing investments through an onsite/offshore Global
Delivery Service, increasing the efficiency of how complex IT projects are
delivered. Syntel's global approach also makes a significant and positive
impact on speed-to-market, budgets, and quality. We deploy a custom
delivery model that is a seamless extension of your IT organization to fit
your business goals and a proprietary knowledge transfer methodology to
guarantee knowledge continuity.
SYNTEL INC.
525 E. Big Beaver, Third Floor
Troy, MI 48083
Phone: 248.619.3503
info@syntelinc.com