Vincent Mcburney (8) : Surrogate Keys For Datastage Server Edition

c

Vincent McBurney | Jul 4, 2007 | Comments (8)
DataStage 8 brings some surrogate key sugar to parallel jobs with the new State Files feature. Now not only can you generate
keys but you can keep track of them between batches.
Surrogate keys are a great way to start an argument on a data warehouse project. Do we need them? How do we create them? Do
they need to be in sequence? Is anyone going to challenge me to a knife fight?
DataStage 8 helps manage the generation of surrogate keys via State Files. This is very similar to what DataStage Server Edition
people have been using for years with the SDK key generation routines that keep surrogate keys in Universe files.
Surrogate Keys for DataStage Server Edition
Some of you might not know this but if you go poking through the DataStage repository window under Routines you will find
some surrogate key management routines:
The routines help generate a key value and save the last value used in a file - just by calling one of the routines from a
Transformer stage.
This use of routines is old school DataStage. You can find plenty of detail about these routines on the Dsxchange in threads like

and Handling KeyMgtGetNextValue() while moving machines - Post2. It¶s covered so well on the
forum I wont go into it here.
Surrogate Keys for DataStage 7
I¶ve always thought that the fastest way to generate surrogate keys in a parallel job is some type of parallel counter from the
Surrogate Key stage or the Transformer. These are very easy to do inside the job. The tricky part is making sure the numbers are
unique across different jobs and different batch runs. It¶s a hurdle that leads a lot of projects to go for the more familiar database
sequence for surrogates.
Y ETL approach: pass the last key used into the job as a job parameter. Increment it in the Surrogate Key stage or the Transformer
stage. The tricky part is getting the latest key and making sure no other job or process is trying to generate the same key.
Y Database approach: bulk load natural keys into a database and into a key mapping table. Map to existing keys or generate new
keys via a database sequence or identity field. Pull the data back into DataStage via a lookup or join. Tricky part is managing the
extra steps and the elapsed time lost waiting for the steps to happen.
Y Using a lookup stage to retrieve new surrogate key values from a database via a sparse lookup. A simple design that is robust in
terms of key management, it can use a Database sequence, but slow due to the database round trips.
DataStage 8 - Enter the State File
State Files have been added to the DataStage 8 Surrogate Key Transformer and they build the convenience of the key
management routine straight into the stage. The Surrogate Key Stage is a fancy name for a counter. In DataStage 7 it let you
provide a starting value and incremented it for each row. Even if it is running in a parallel job across many instances. In
DataStage 8 is gets an upgrade - it can keep track of previous values all by itself.
Creating the State File
As far as I can tell you can create a state file just by creating an empty file with the right name and then putting that name in the
Surrogate Key stage. I prefer a job parameter for this path and file name.
You can also create a state file using a DataStage job. This is easy but kind of odd. You create a job with nothing but a Surrogate
Key Stage in it - you set it to an action of Create and you run the job. It creates the file:
Generating Keys
The Surrogate Key Stage has a big advantage of opening the state file, retrieving the last key used, generating new keys and
writing the last key used back to the state file. All with no effort or code from the user! The Surrogate Key stage will create a new
output column that has an incrementing field on it. This key will be unique for that job even for a parallel job with multiple
instances of the stage running at once:
It writes back numbers to the state file as it goes - so you have some minor I/O traffic with this approach. The file barely gets up
to 1K and often sits at 0K so there is very little data in it.
Avoiding Duplicate Numbers
Whenever I¶ve proposed an ETL only approach to surrogate keys I¶ve been beaten down by two arguments:
Question: K

It¶s quite possible that you have several jobs trying to generate surrogate keys for the same target table. If you were sure that they
never run at the same time you can configure them to use the same State File and they will always have unique keys. If you
cannot guarantee this you can still have unique keys.
1.Y Give each job a unique number as a text field of two characters. Eg. 10 for JobA and 11 for JobB.
2.Y Give each job its own State File so they all start at number 1 and increment by 1 for each row. You can put the Job code into the
State File name. Use a file name like #StateFilePath#/CustomerSKey#JobCode#.txt
3.Y Append the job identifier to the surrogate key generated in a Transformer Stage. This turns the sequence of 1, 2, 3 into 110, 210,
310 or 111, 211, 311. Starting the ID at 10 or 100 or 1000 instead of 01 or 001 or 0001 prevents leading zeros from being lost in
the process. In a Transformer use the concatenation ":" character to concatenate two fields.
4.Y All rows from JobA will always end in 10, all rows from JobB will end in 11. They will never have conflicting surrogate key
values.
A two digit code gives you up to 99 jobs writing to the same target table. A three digit job identifier gives you 999 codes. If you
want unique job codes across all jobs you could choose 5 characters so you can have up to 99999 that will never have conflicting
surrogate key values.
Question:

Answer: Tell them to take a flying jump in the lake. Seriously. You shouldn¶t be implementing a second rate approach on the
hypothetical possibility that someone will use a worse product to load data. Use jobs to resynch the State File values if any other
job alters the last key used.
The Dog Ate my State File
You should be backing up your state file directory - you call put all your state files in one place - but if the dog eats them you will
need to recreate them. This creates a bit of a mess. You see you cannot just look inside state files. A state file has a specific
format that you cannot modify, if you tried to look at a State File in notepad you might see something meaningless like this:
Qà[ pã[ ¡Û[ Öß[
To rebuild your state files with the right numbers you will need a generic job that retrieves the last key used from a database and
writes it to a new state files:
You can use a user-defined database SQL statement that uses job parameters to build the retrieve statement:
SELECT MAX(#SKEY_FIELD#) FROM #SKEY_TABLE#
You can then pass this job the surrogate key field, table and target state file name. This job can be used to update the key for any
state file. This is also a way to synchronise your state file if some other way is used to load data into the target table.
If you want to manually set the surrogate key value and cannot pull it from a database you can use the row generator to send a
value into the Surrogate Key Stage with the "Create and Update" option turned on and pass a job parameter into the state file as a
last value used:
In this example the Row Generator does nothing - it just make sure a single row is generated for this job. The Transformer stage
intercepts this row and writes the job parameter NEW_LAST_KEY to the output of the Transformer. This gets passed into the
Surrogate Key Stage as the new key value for the state file.
If you make the state file name and the last key used both job parameters this job can be used to update any state file.

c
Joshy George | Feb 14, 2008 | Comments (6)
An elegant and fast way to generate surrogate keys in a parallel job!
This is a hot topic discussed and attempted by most of the ETL architects, designers and developers. This article looks at an
elegant way for Surrogate Key Generation in a DataStage Parallel job, without having the overhead of creating multiple jobs or
state file maintenance. This might fall slightly into the advanced way or for power users, as this includes creation of a parallel
routine using DataStage Development Kit (Job Control Interfaces). But the strategy is definitely simple and elegant, and you can
do it in one job and maintain the surrogate key in a centralised and editable location â¼³ an environment Variable defined in
Administrator. Gives you wings to use it across the project in different jobs as well.

1) Starting Key Value / The Last Key Used is an Environment Variable which is defined in Administrator.
2) Increment the surrogate key in the Surrogate Key stage or the Transformer or the column generator approach by passing the
starting value as the defined Environment Variable.
3) To capture the last key generated (Finally), use a tail stage with properties for "Number of Rows (Per Partition) = 1" & "All
partitions = True".
4) Capture the last record in a Transformer stage using "@PARTITIONNUM +1 = @NUMPARTITIONS" in the constraint /
filter.
5) Call a parallel routine which uses C/C++ APIs, especially DSSetEnvVar to update the Environment Variable with the last
record value for surrogate key.
Ex: SetEnvParam(DSProjectName, DSJobName, '', Input_Link.LAST_KEY+1).

* h

* h

!

"h
! #
* $
%
&
'&
* (

")* #
* +

!
!,,
* %$*-
c

!

"
#
$ %&
' "

(' "

"
!

) "

Ref. Parallel Job Advanced Developer's Guide - Chapter 7 : DataStage Development Kit (Job Control Interfaces). This is the API
call which sets an environment variable:

Remember to add appropriate messages (Info/Warning) to the log from this routine.
""
"

Here is a sample of the log:
c" *

You can consider different strategies for this. Here is one from Vincent McBurney's blog:
It's quite possible that you have several jobs trying to generate surrogate keys for the same target table. If you were sure that they
never run at the same time then you don't have to do anything, they will always have unique keys. If you cannot guarantee this
you can still have unique keys.
Give each job a unique number as a text field of two characters. Eg. 10 for JobA and 11 for JobB.
Append the job identifier to the surrogate key generated in a Transformer Stage. This turns the sequence of 1, 2, 3 into 110, 210,
310 or 111, 211, 311. Starting the ID at 10 or 100 or 1000 instead of 01 or 001 or 0001 prevents leading zeros from being lost in
the process. In a Transformer use the concatenation ":" character to concatenate two fields.
All rows from JobA will always end in 10, all rows from JobB will end in 11. They will never have conflicting surrogate key
values.
A two digit code gives you up to 99 jobs writing to the same target table. A three digit job identifier gives you 999 codes. If you
want unique job codes across all jobs you could choose 5 characters so you can have up to 99999 that will never have conflicting
surrogate key values.
Most important strategy to be adopted when jobs run at the same time to generate surrogate keys for the same target table is to
update the Environment Variable with the last record value for surrogate key. All the jobs will update the Environment Variable
with the last record value for surrogate key:
* Trim the last part (ie. unique job number) from the last record value for surrogate key and use that value to update the
environment variable
* Make sure before updating the Environment Variable to check that the current value is not more than the value going to be
updated with, if it is more then you shouldn't update. Write down a small example scenario on this, you will understand why!
Want to generate the above job number also? Rather than making it job specific? Here is the trick. Along with surrogate key
environment variable for a target table, assign a running job number variable (Env Variable) with initial value as 10. Whenever a
new job starts loading to the target table pick the surrogate key environment variable along with the "running job number + 1"
and update running job number variable with this incremented value (ie. +1). In the transformer as noted above, concatenate
"Environment Variable : Running job number variable" to get the surrogate key. When the job finishes and while you update
Environment Variable, decrement corresponding Running job number by 1.
)

+
This case you need to pick the max value from the target table and use the same strategy specified above. Pass it thru a
transformer and update the environment variable before the start of main job using the same conditions.
For more on parallel routines: DataStage Parallel routines made really easy
c Y Y
1. What is the flow of load
ing data into fact & dimensional tables?
A) Fact table - Table with Collection of Foreign Keys corresponding to the Primary
Keys in Dimensional table. Consists of fields with numeric values.
Dimension table - Table with Unique Primary Key.
Load - Data should be first loaded into dimensional table. Based on the primary key
values in dimensional table, the data should be loaded into Fact table.
2. What is the default cache size? How do you change the cache size if needed?
A. Default cache size is 256 MB. We can increase it by going into Datastage
Administrator and selecting the Tunable Tab and specify the cache size over there.
3. What are types of Hashed File?
A) Hashed File is classified broadly into 2 types.
a) Static - Sub divided into 17 types based on Primary Key Pattern.
b) Dynamic - sub divided into 2 types
i) Generic ii) Specific.
Dynamic files do not perform as well as a well, designed static file, but do perform better
than a badly designed one. When creating a dynamic file you can specify the following
Although all of these have default values)
By Default Hashed file is "Dynamic - Type Random 30 D"
4. What does a Config File in parallel extender consist of?
A) Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.
5. What is Modulus and Splitting in Dynamic Hashed File?
A. In a Hashed File, the size of the file keeps changing randomly.
If the size of the file increases it is called as "Modulus".
If the size of the file decreases it is called as "Splitting".
6. What are Stage Variables, Derivations and Constants?
A. Stage Variable - An intermediate processing variable that retains value during read
and doesn¶t pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.
7. Types of views in Datastage Director?
There are 3 types of views in Datastage Director
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last run
c) Status View - Warning Messages, Event Messages, Program Generated Messages.
8. Types of Parallel Processing?
A) Parallel Processing is broadly classified into 2 types.
a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.
9. Orchestrate Vs Datastage Parallel Extender?
A) Orchestrate itself is an ETL tool with extensive parallel processing capabilities and
running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version
of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased
Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0
i.e Parallel Extender.
10. Importance of Surrogate Key in Data warehousing?
A) Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is
it is independent of underlying database. i.e. Surrogate Key is not affected by the changes
going on with a database.
11. How to run a Shell Script within the scope of a Data stage job?
A) By using "ExcecSH" command at Before/After job properties.
12. How to handle Date conversions in Datastage? Convert a mm/dd/yyyy format to
yyyy-dd-mm?
A) We use a) "Iconv" function - Internal Conversion.
b) "Oconv" function - External Conversion.
Function to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")
13 How do you execute datastage job from command line prompt?
A) Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname
14. Functionality of Link Partitioner and Link Collector?
Link Partitioner: It actually splits data into various partitions or data flows using
various partition methods.
Link Collector: It collects the data coming from partitions, merges it into a single data
flow and loads to target.
15. Types of Dimensional Modeling?
A) Dimensional modeling is again sub divided into 2 types.
a) Star Schema - Simple & Much Faster. Denormalized form.
b) Snowflake Schema - Complex with more Granularity. More normalized form.
16. Differentiate Primary Key and Partition Key?
ƀPage 4 of 210ƀ
Primary Key is a combination of unique and not null. It can be a collection of key values
called as composite primary key. Partition Key is a just a part of Primary Key. There are
several methods of partition like Hash, DB2, and Random etc. While using Hash partition
we specify the Partition Key.
17. Differentiate Database data and Data warehouse data?
A) Data in a Database is
a) Detailed or Transactional
b) Both Readable and Writable.
c) Current.
18. Containers Usage and Types?
Container is a collection of stages used for the purpose of Reusability.
There are 2 types of Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job within a project.
19. Compare and Contrast ODBC and Plug-In stages?
ODBC: a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.
Plug-In: a) Good Performance.
b) Database specific. (Only one database)
c) Cannot handle Stored Procedures.
20. Dimension Modelling types along with their significance
Data Modelling is Broadly classified into 2 types.
a) E-R Diagrams (Entity - Relatioships).
b) Dimensional Modelling.
Q 21 What are Ascential Dastastage Products, Connectivity
Ans:
Ascential Products
Ascential DataStage
Ascential DataStage EE (3)
Ascential DataStage EE MVS
Ascential DataStage TX
Ascential QualityStage
Ascential MetaStage
Ascential RTI (2)
Ascential ProfileStage
Ascential AuditStage
Ascential Commerce Manager
Industry Solutions
Connectivity
Files
RDBMS
Real-time
PACKs
EDI
Other
Q 22 Explain Data Stage Architecture?
Data Stage contains two components,
Client Component.
Server Component.
Client Component:
ë Data Stage Administrator.
ë Data Stage Manager
ë Data Stage Designer
ë Data Stage Director
Server Components:
ë Data Stage Engine
ë Meta Data Repository
ë Package Installer
Data Stage Administrator:
Used to create the project.
Contains set of properties
We can set the buffer size (by default 128 MB)
We can increase the buffer size.
We can set the Environment Variables.
In tunable we have in process and inter-process
In-process²Data read in sequentially
Inter-process² It reads the data as it comes.
It just interfaces to metadata.
Data Stage Manager:
We can view and edit the Meta data Repository.
We can import table definitions.
We can export the Data stage components in .xml or .dsx format.
We can create routines and transforms
We can compile the multiple jobs.
Data Stage Designer:
We can create the jobs. We can compile the job. We can run the job. We can
declare stage variable in transform, we can call routines, transform, macros, functions.
We can write constraints.
Data Stage Director:
We can run the jobs.
We can schedule the jobs. (Schedule can be done daily, weekly, monthly, quarterly)
We can monitor the jobs.
We can release the jobs.
Q 23 What is Meta Data Repository?
Meta Data is a data about the data.
It also contains
ë Query statistics
ë ETL statistics
ë Business subject area
ë Source Information
ë Target Information
ë Source to Target mapping Information.
Q 24 What is Data Stage Engine?
It is a JAVA engine running at the background.
Q 25 What is Dimensional Modeling?
Dimensional Modeling is a logical design technique that seeks to present the data
in a standard framework that is, intuitive and allows for high performance access.
Q 26 What is Star Schema?
Star Schema is a de-normalized multi-dimensional model. It contains centralized fact
tables surrounded by dimensions table.
Dimension Table: It contains a primary key and description about the fact table.
Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.
Q 27 What is surrogate Key?
It is a 4-byte integer which replaces the transaction / business / OLTP key in the
dimension table. We can store up to 2 billion record.
Q 28 Why we need surrogate key?
It is used for integrating the data may help better for primary key.
Index maintenance, joins, table size, key updates, disconnected inserts and
partitioning.
Q 29 What is Snowflake schema?
It is partially normalized dimensional model in which at two represents least one
dimension or more hierarchy related tables.
Q 30 Explain Types of Fact Tables?
Factless Fact: It contains only foreign keys to the dimension tables.
Additive Fact: Measures can be added across any dimensions.
Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount
Non-Additive: Measures cannot be added across any dimensions. Eg, Average
ƀPage 8 of 210ƀ
Conformed Fact: The equation or the measures of the two fact tables are the same under
the facts are measured across the dimensions with a same set of measures.
Q 31 Explain the Types of Dimension Tables?
Conformed Dimension: If a dimension table is connected to more than one fact table,
the granularity that is defined in the dimension table is common across between the fact
tables.
Junk Dimension: The Dimension table, which contains only flags.
Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension.
De-generative Dimension: It is line item-oriented fact table design.
Q 32 What are stage variables?
Stage variables are declaratives in Transformer Stage used to store values. Stage
variables are active at the run time. (Because memory is allocated at the run time).
Q 33 What is sequencer?
It sets the sequence of execution of server jobs.
Q 34 What are Active and Passive stages?
Active Stage: Active stage model the flow of data and provide mechanisms for
combining data streams, aggregating data and converting data from one data type to
another. Eg, Transformer, aggregator, sort, Row Merger etc.
Passive Stage: A Passive stage handles access to Database for the extraction or writing
of data. Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.
Q 35 What is ODS?
Operational Data Store is a staging area where data can be rolled back.
Q 36 What are Macros?
They are built from Data Stage functions and do not require arguments.
A number of macros are provided in the JOBCONTROL.H file to facilitate getting
information about the current job, and links and stages belonging to the current job.
These can be used in expressions (for example for use in Transformer stages), job control
routines, filenames and table names, and before/after subroutines.
These macros provide the functionality of using the DSGetProjectInfo, DSGetJobInfo,
DSGetStageInfo, and DSGetLinkInfo functions with the DSJ.ME token as the JobHandle
and can be used in all active stages and before/after subroutines. The macros provide the
functionality for all the possible InfoType arguments for the DSGet«Info functions. See
the Function call help topics for more details.
The available macros are:
DSHostName
DSProjectName
DSJobStatus
DSJobName
ƀPage 9 of 210ƀ
DSJobController
DSJobStartDate
DSJobStartTime
DSJobStartTimestamp
DSJobWaveNo
DSJobInvocations
DSJobInvocationId
DSStageName
DSStageLastErr
DSStageType
DSStageInRowNum
DSStageVarList
DSLinkRowCount
DSLinkLastErr
DSLinkName
1) Examples
2) To obtain the name of the current job:
3) MyName = DSJobName
To obtain the full current stage name:
MyName = DSJobName : r.r : DSStageName
Q 37 What is keyMgtGetNextValue?
It is a Built-in transform it generates Sequential numbers. Its input type is literal string &
output type is string.
Q 38 What are stages?
The stages are either passive or active stages.
Passive stages handle access to databases for extracting or writing data.
Active stages model the flow of data and provide mechanisms for combining data
streams, aggregating data, and converting data from one data type to another.
Q 39 What index is created on Data Warehouse?
Bitmap index is created in Data Warehouse.
Q 40 What is container?
A container is a group of stages and links. Containers enable you to simplify and
modularize your server job designs by replacing complex areas of the diagram with a
single container stage. You can also use shared containers as a way of incorporating
server job functionality into parallel jobs.
DataStage provides two types of container:
ë Local containers. These are created within a job and are only accessible by that
job. A local container is edited in a tabbed page of the job¶s Diagram window.
ë Shared containers. These are created separately and are stored in the Repository
in the same way that jobs are. There are two types of shared container
Q 41 What is function? ( Job Control ± Examples of Transform Functions )
Functions take arguments and return a value.
ë BASIC functions: A function performs mathematical or string manipulations on
the arguments supplied to it, and return a value. Some functions have 0
arguments; most have 1 or more. Arguments are always in parentheses, separated
by commas, as shown in this general syntax:
FunctionName (argument, argument)
ë DataStage BASIC functions: These functions can be used in a job control
routine, which is defined as part of a job¶s properties and allows other jobs to be
run and controlled from the first job. Some of the functions can also be used for
getting status information on the current job; these are useful in active stage
expressions and before- and after-stage subroutines.
To do this ... Use this function ...
Specify the job you want to control DSAttachJob
Set parameters for the job you want to control DSSetParam
Set limits for the job you want to control DSSetJobLimit
Request that a job is run DSRunJob
Wait for a called job to finish DSWaitForJob
Gets the meta data details for the specified link DSGetLinkMetaData
Get information about the current project DSGetProjectInfo
Get buffer size and timeout value for an IPC or Web Service
stage
DSGetIPCStageProps
Get information about the controlled job or current job DSGetJobInfo
Get information about the meta bag properties associated with
the named job
DSGetJobMetaBag
Get information about a stage in the controlled job or current
job
DSGetStageInfo
Get the names of the links attached to the specified stage DSGetStageLinks
Get a list of stages of a particular type in a job. DSGetStagesOfType
Get information about the types of stage in a job. DSGetStageTypes
Get information about a link in a controlled job or current job DSGetLinkInfo
Get information about a controlled job¶s parameters DSGetParamInfo
Get the log event from the job log DSGetLogEntry
Get a number of log events on the specified subject from the
job log
DSGetLogSummary
Get the newest log event, of a specified type, from the job log DSGetNewestLogId
Log an event to the job log of a different job DSLogEvent
Stop a controlled job DSStopJob
Return a job handle previously obtained from DSAttachJob DSDetachJob
Log a fatal error message in a job's log file and aborts the job. DSLogFatal
Log an information message in a job's log file. DSLogInfo
Put an info message in the job log of a job controlling current
job.
DSLogToController
Log a warning message in a job's log file. DSLogWarn
Generate a string describing the complete status of a valid
attached job.
DSMakeJobReport
Insert arguments into the message template. DSMakeMsg
Ensure a job is in the correct state to be run or validated. DSPrepareJob
Interface to system send mail facility. DSSendMail
Log a warning message to a job log file. DSTransformError
Convert a job control status or error code into an explanatory
text message.
DSTranslateCode
Suspend a job until a named file either exists or does not exist. DSWaitForFile
Checks if a BASIC routine is cataloged, either in VOC as a
callable item, or in the catalog space.
DSCheckRoutine
Execute a DOS or Data Stage Engine command from a
before/after subroutine.
DSExecute
Set a status message for a job to return as a termination
message when it finishes
DSSetUserStatus
Q 42 What is Routines?
Routines are stored in the Routines branch of the Data Stage Repository, where you can
create, view or edit. The following programming components are classified as routines:
Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX
(OLE) functions, Web Service routines
Q 43 What is data stage Transform?
Q 44 What is Meta Brokers?
Q 45 What is usage analysis?
Q 46 What is job sequencer?
ƀPage 12 of 210ƀ
Q 47 What are different activities in job sequencer?
Q 48 What are triggers in data Stages? (conditional, unconditional, otherwise)
Q 49 Are u generated job Reports? S
Q 50 What is plug-in?
Q 51 Have u created any custom transform? Explain? (Oconv)
å

Y

Y Y Y
YY
Y YYY

YYY
YY
Y
Y
Y
Y
Y
YY
Y
Y
Y Y Y
YYY YYY
YY
YYY
! !!! "

Y
Y
Y
YYY
YY YYYYYYY
Y Y
Y
# ! !! $
!YY YYY"
Y# YY
Y"YY#
Y"$Y%
Y#
Y"
Y# YYY

$Y&YY
"
Y'Y
"
()YY
# ! %&% '
Y*Y YYYY(YYYYY
Y Y
Y Y$Y YY
+YY,YYYYY,Y
Y
&! ! ! ! () ! ! ()
Y -YYY
Y(YY **-Y(YYY

(
YYYYY.$*Y
å) *! +
Y/$Y Y
YY Y
YYYYY0
YYY
YYY YY
1* YY2Y YYY

"/ YYY
! ,- * ,- !!

Y
Y Y%YY# 3Y

Y
YY
YYYYYYYY4

Y#YY5Y YY
YYY
Y6Y

YY Y
Y%Y
Y Y,Y

YY,Y
("! .!! !
YY
Y
Y$ Y

Y
Y5Y YY.14Y
Y".
Y1Y4

YY

Y
Y
*/ 0& + ! 0&
Y Y
7YY,Y.
YY(YY
Y Y6Y
YY YY8 Y9Y:YY
Y
YY#Y
YY Y Y
Y:Y Y Y
Y Y,Y
# )1 ! 0
YYYY
YYYYYYY
YY YYY
Y#YY
Y YYYYY*Y YYY
YYYYY$5YY1YYY
YYYY/$Y
(Y
- ! 0

YY# -Y/4/*;<=-Y/$-Y
1

-Y
1-Y>-Y/$-Y-Y Y
# 0& ! 0
Y?@@AY,YY( YBY Y
Y YY
Y( -Y
Y YY
Y
YC@Y,YY( YBY YY
YY
YYY Y YY( Y
# ! !- 23 !! -
YDY#Y
YY YYYEY*Y(YY
Y
Y Y
?Y

YYYYYYYY,"Y Y+
YY,YYY+Y
Y
5Y YY YY
YY YY
å
c )
,
& +"
-
.
Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined
every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-
rows where data is expected.
)
#c/01

#

#'

.
Link Partitioner - Used for partitioning the data.
Link Collector - Used for collecting the partitioned data.
2
3
4

.
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following
are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
2
å' $% ) $%

.
IConv() - Converts a string to an internal storage format
OConv() - Converts an expression to an output format.
- c56
"
.
Using DB2 ODBC drivers.
+" 7 .
MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data
defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of
MetaStage.
c # )* 3)84,9 )8 .
Qulaity Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship
like that so that we can do the Quality related works and we can integrate with datastage we need Quality stage plugin to achieve
the task.
+"
å
4:.
Oracle 8i does not support pseudo column sysdate but 9i supports
Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields)
-
c.
Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate
the 2 files into one if the metadata is different.
2 cc

.
You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data
source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer
work area, and add links.
W c

.
The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language
Support (NLS) is enabled, install and manage maps and locales.
2 cc

.
: datastage director is used to run the jobs and validate the jobs.
we can go to datastage director from datastage designer it self.
2 c7

.
: The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository
2
- c - .
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size
of 2Gb and the overflow file is used if the data exceeds the 2GB size.
2 -
.
Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance.
-
c .
Find where data for this dimension are located.
Figure out how to extract this data.
Determine how to maintain changes to this dimension.
Change fact table and DW population routines.
Y

Vincent Mcburney (8) : Surrogate Keys For Datastage Server Edition

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vincent Mcburney (8) : Surrogate Keys For Datastage Server Edition

Uploaded by

Copyright:

Available Formats

c

An elegant and fast way to generate surrogate keys in a parallel job!

* h

Here is a sample of the log:

! !!! "

Y# YYY

! ,- * ,- !!

- ! 0

You might also like

Vincent Mcburney (8) : Surrogate Keys For Datastage Server Edition

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vincent Mcburney (8) : Surrogate Keys For Datastage Server Edition

Uploaded by

Copyright:

Available Formats

c 

  

An elegant and fast way to generate surrogate keys in a parallel job!

* h          

Here is a sample of the log:

            

      !  !!! "   

Y#  YYY

   !  ,- *   ,-  !! 

        -    !  0

You might also like

c

* h

! !!! "

Y# YYY

! ,- * ,- !!

- ! 0