You are on page 1of 75

DataStage Designer

What is Data warehouse ?

What is Operational Databases ?

Data Extraction ?

Data Aggregation ?

Data Transformation ?

Advantages of Data warehouse ?

DataStage ?

Client Component ?

Server Component ?

DataStage Jobs ?

DataStage NLS ?

Stages

Passive Stage ?
Active Stage ?

Server Job Stages

stage for aggregating data. This is a passive stage.

UniData. -> Extracts data from or loads data into UniDa

Oracle 7 Load. - > Bulk loads an Oracle 7 database. Previ

Sybase BCP Load. - >Bulk loads a Sybase 6 database. Pre


File

Processing

Parallel Job Stage

Oracle Enterprise. Allows you to read and write an


Oracle database.
Teradata Enterprise. Allows you to read and write a
Teradata database.

Development/Debug Stages
Row Generator. -> Generates a dummy data set.
Column Generator. -> Adds extra columns to a data set.

Head. -> Copies the specified number of records from


the beginning of a data partition.

Sample. -> Samples a data set.


Tail. -> Copies the specified number of records from the
end of a data partition.
Write range map. -> Enables you to carry out range map
partitioning on a data set.

File Stages
Complex Flat File. -> Allows you to read or writecomplex
Data set.-> Stores a set of data.
External source. -> Allows a parallel job to read anextern
External target. -> Allows a parallel job to write to anext
File set. -> A set of files used to store data.
Lookup file set. ->Provides storage for a lookup table.
SAS data set. -> Provides storage for SAS data sets.
Sequential file. -> Extracts data from, or writes data to, a
Processing Stages
Transformer. - >Receives incoming data, transforms it in

Change apply. -> Applies a set of captured changes to a


Change Capture. -> Compares two data sets and records
Compare. -> Performs a column by column compare oft
Compress. -> Compresses a data set.
Copy . -> Copies a data set.
Decode. -> Uses an operating system command to deco
Difference. -> Compares two data sets and works out th
Encode. -> Encodes a data set using an operating system
Expand. -> Expands a previously compressed data set.
External Filter. -> Uses an external program to filter a da
Filter. -> Transfers, unmodified, the records of the inputd

Funnel. -> Copies multiple data sets to a single data set.


Generic. -> Allows Orchestrate experts to specify their o
Lookup. -> Performs table lookups.
Merge.-> Combines data sets.
Modify. -> Alters the record schema of its input data set
Remove duplicates.-> Removes duplicate entries from a

Sort. -> Sorts input columns.


Switch. -> Takes a single data set as input and assigns ea
Surrogate Key.-> Generates one or more surrogate keyco
Real Time

Restructure
Column export. -> Exports a column of another type to a
Column import. -> Imports a column from a string orbin
Combine records. -> Combines several columns associat
Make subrecord. -> Combines a number of vectors to fo
Make vector. -> Combines a number of fields to form ave
Promote subrecord. -> Promotes the members of asubre
Split subrecord. -> Separates a number of subrecords int
Split vector. -> Separates a number of vector members in
Other Stages

Links ?

Parallel Processing

Types of Parallelism

Plug in Stage?

Difference Between Lookup and Join:


What is Staging Variable?

What are Routines?

what are the Job parameters?

What are Stage Variables, Derivations


and Constants?

why fact table is in normal form?

What are an Entity, Attribute and


Relationship?

DataStage Designer
A data warehouse is a central integrated database containing data from all
the operational sources and archive systems in an organization. It contains
a copy of transaction data specifically structured for query analysis.
This database can be accessed by all users, ensuring that each group in an organization
is accessing valuable, stable data.
Operational databases are usually accessed by many concurrent users. The
data in the database changes quickly and often. It is very difficult to obtain
an accurate picture of the contents of the database at any one time.
Because operational databases are task oriented, for example, stock inventory
systems, they are likely to contain dirty data. The high throughput
of data into operational databases makes it difficult to trap mistakes or
incomplete entries. However, you can cleanse data before loading it into a
data warehouse, ensuring that you store only good complete records.
Data extraction is the process used to obtain data from operational sources, archives, and
external data sources.
The summed (aggregated) total is stored in the data warehouse. Because
the number of records stored in the data warehouse is greatly reduced, it
is easier for the end user to browse and analyze the data.
Transformation is the process that converts data to a required definition and value.
Data is transformed using routines based on a transformation rule, for
example, product codes can be mapped to a common format using a transformation
rule that applies only to product codes.
After data has been transformed it can be loaded into the data warehouse
in a recognized and required format.
Capitalizes on the potential value of the organizations information
Improves the quality and accessibility of data
Combines valuable archive data with the latest data in operational sources
Increases the amount of information available to users

Reduces the requirement of users to access operational data


Reduces the strain on IT departments, as they can produce one database to serve all user groups
Allows new reports and studies to be introduced without disrupting operational systems
Promotes users to be self sufficient

the design and processing required to build a data warehouse. It is ETL


Extracts data from any number or type of database.
Transforms data. DataStage has a set of predefined transforms and functions you can use to conv
You can easily extend the functionality by defining your own transforms to use.
Loads the data warehouse.

It Consist of number of Client Component ans Server Component


DataStage server and parallel jobs are compiled and run on the DataStage server. The job will conn
extract data, process it, then write the data to the target data warehouse.

DataStage mainframe jobs are compiled and run on a mainframe. Data extracted by such jobs is th

DataStage Designer -> A design interface used to create DataStageapplications (known as jobs).
DataStage Director-> A user interface used to validate, schedule,run, and monitor DataStage server
DataStage Manager -> A user interface used to view and edit thecontents of the Repository.
DataStage Administrator -> A user interface used to perform administrationtasks such as setting up
creating and moving projects, and setting up purging criteria.

Repository -> A central store that contains all the informationrequired to build a data mart or data w
DataStage Server -> Runs executable jobs that extract, transform,and load data into a data warehou
Datastage Package Installer -> A user interface used to install packagedDataStage jobs and plug-ins.
Basic type of DataStage
Server Jobs ->These are compiled and run on the DataStage server.
A server job will connect to databases on other machines as necessary,
extract data, process it, then write the data to the target data
warehouse.
Parallel Jobs -> These are compiled and run on the DataStage server
in a similar way to server jobs, but support parallel processing on
SMP, MPP, and cluster systems.

MainFrame Jobs -> These are available only if you have Enterprise
MVS Edition installed. A mainframe job is compiled and run on
the mainframe. Data extracted by such jobs is then loaded into the
data warehouse.
Shared Containers -> These are reusable job elements. They typically
comprise a number of stages and links. Copies of shared containers
can be used in any number of server jobs or parallel jobs and edited
as required.
Job Sequences -> A job sequence allows you to specify a sequence of
DataStage jobs to be executed, and actions to take depending on
results.
Built in Stages -> Supplied with DataStage and used for extracting,
aggregating, transforming, or writing data. All types of job have
these stages.
Plug in Stages-> Additional stages that can be installed in DataStage
to perform specialized tasks that the built-in stages do not support.
Server jobs and parallel jobs can make use of these.
Job Sequences Stages-> Special built-in stages which allow you to
define sequences of activities to run. Only Job Sequences have
these.
DataStage has built-in National Language Support (NLS). With NLS installed, DataStage
can do the following:
Process data in a wide range of languages
Accept data in any character set into most DataStage fields
Use local formats for dates, times, and money (server jobs)
Sort data according to local rules
Convert data between different encodings of the same language
(for example, for Japanese it can convert JIS to EUC)

A job consists of stages linked together which describe the flow of datafrom a data source to a data
The different types of job have different stage types. The stages that areavailable in the DataStage D
currently open in the Designer.
A passive stage handlesaccess to databases for the extraction or writing of data.

Active stagesmodel the flow of data and provide mechanisms for combining datastreams, aggregati
and converting data from one data type to another.

Database
ODBC. -> Extracts data from or loads data into databases that support the industry standard Open
Connectivity API. -> This stage is also used as an intermediate stage for aggregating data. This is a p

UniVerse. -> Extracts data from or loads data into UniVerse databases. This stage is also used as an
stage for aggregating data. This is a passive stage.
UniData. -> Extracts data from or loads data into UniData databases. This is a passive stage.
Oracle 7 Load. - > Bulk loads an Oracle 7 database. Previously known as ORABULK.
Sybase BCP Load. - >Bulk loads a Sybase 6 database. Previously known as BCPLoad.
File
Hashed File. -> Extracts data from or loads data into databases
that contain hashed files. Also acts as an
intermediate stage for quick lookups. This is a passive stage.
Sequential File. -> Extracts data from, or loads data into,
operating system text files. This is a passive stage.
Processing
Aggregator.-> Classifies inc oming data into groups,
computes totals and other summary functions for each
group, and passes them to another stage in the job. This
is an active stage.
BASIC Transformer. -> Receives incoming data, transforms
it in a variety of ways, and outputs it to another
stage in the job. This is an active stage.
Folder. -> Folder stages are used to read or write data as
files in a directory located on the DataStage server.
Inter-process. ->Provides a communication channel
between DataStage processes running simultaneously in
the same job. This is a passive stage.
Link Partitioner. -> Allows you to partition a data set into
up to 64 partitions. Enables server jobs to run in parallel
on SMP systems. This is an active stage.

Link Collector. -> Collects partitioned data from up to 64


partitions. Enables server jobs to run in parallel on SMP
systems. This is an active stage.
RealTime
RTI Source. -> Entry point for a Job exposed as an RTI
service. The Table Definition specified on the output link
dictates the input arguments of the generated RTI
service.
RTI Target. -> Exit point for a Job exposed as an RTI
service. The Table Definition on the input link dictates
the output arguments of the generated RTI service.
Containers
Server Shared Container. -> Represents a group of stages
and links. The group is replaced by a single Shared
Container stage in the Diagram window.
Local Container. -> Represents a group of stages and links.
The group is replaced by a single Container stage in the
Diagram window
Container Input and Output. -> Represent the interface
that links a container stage to the rest of the job design.
DataBases
DB2/UDB Enterprise. Allows you to read and write a
DB2 database.
Informix Enterprise. Allows you to read and write an
Informix XPS database.
Oracle Enterprise. Allows you to read and write an
Oracle database.
Teradata Enterprise. Allows you to read and write a
Teradata database.
Development/Debug Stages
Row Generator. -> Generates a dummy data set.
Column Generator. -> Adds extra columns to a data set.

Head. -> Copies the specified number of records from


the beginning of a data partition.
Peek. -> Prints column values to the screen as records are
copied from its input data set to one or more output
data sets.
Sample. -> Samples a data set.
Tail. -> Copies the specified number of records from the
end of a data partition.
Write range map. -> Enables you to carry out range map
partitioning on a data set.

File Stages
Complex Flat File. -> Allows you to read or writecomplex flat files on a mainframe machine. This isin
Data set.-> Stores a set of data.
External source. -> Allows a parallel job to read anexternal data source.
External target. -> Allows a parallel job to write to anexternal data source.
File set. -> A set of files used to store data.
Lookup file set. ->Provides storage for a lookup table.
SAS data set. -> Provides storage for SAS data sets.
Sequential file. -> Extracts data from, or writes data to, atext file.

Processing Stages
Transformer. - >Receives incoming data, transforms it in avariety of ways, and outputs it to another
Aggregator. -> Classifies incoming data into groups,
computes totals and other summary functions for each
group, and passes them to another stage in the job.
Change apply. -> Applies a set of captured changes to a data set.
Change Capture. -> Compares two data sets and recordsthe differences between them.
Compare. -> Performs a column by column compare oftwo pre-sorted data sets.
Compress. -> Compresses a data set.
Copy . -> Copies a data set.
Decode. -> Uses an operating system command to decodea previously encoded data set.
Difference. -> Compares two data sets and works out the difference between them.
Encode. -> Encodes a data set using an operating systemcommand.
Expand. -> Expands a previously compressed data set.
External Filter. -> Uses an external program to filter a dataset.
Filter. -> Transfers, unmodified, the records of the inputdata set which satisfy requirements that you

Funnel. -> Copies multiple data sets to a single data set.


Generic. -> Allows Orchestrate experts to specify their owncustom commands.
Lookup. -> Performs table lookups.
Merge.-> Combines data sets.
Modify. -> Alters the record schema of its input data set.
Remove duplicates.-> Removes duplicate entries from adata set.
SAS(Statistical Analysis System)-> Allows you to run SAS applications from within
the DataStage job.
Sort. -> Sorts input columns.
Switch. -> Takes a single data set as input and assigns eachinput record to an output data set based
Surrogate Key.-> Generates one or more surrogate keycolumns and adds them to an existing data se
Real Time
RTI Source. -> Entry point for a Job exposed as an RTI
service. The Table Definition specified on the output link
dictates the input arguments of the generated RTI
service.
RTI Target. -> Exit point for a Job exposed as an RTI
service. The Table Definition on the input link dictates
the output arguments of the generated RTI service.
Restructure
Column export. -> Exports a column of another type to astring or binary column.
Column import. -> Imports a column from a string orbinary column.
Combine records. -> Combines several columns associatedby a key field to build a vector.
Make subrecord. -> Combines a number of vectors to forma subrecord.
Make vector. -> Combines a number of fields to form avector.
Promote subrecord. -> Promotes the members of asubrecord to a top level field.
Split subrecord. -> Separates a number of subrecords intotop level fields.
Split vector. -> Separates a number of vector members intoseparate columns.
Other Stages
Parallel Shared Container. -> Represents a group of stages
and links. The group is replaced by a single Parallel
Shared Container stage in the Diagram window. Parallel
Shared Container stages are handled differently to other
stage types, they do not appear on the palette.

Local Container. -> Represents a group of stages and links.


The group is replaced by a single Container stage in the
Diagram window
Container Input and Output. -> Represent the interface
that links a container stage to the rest of the job design.
Links join the various stages in a job together and are used to specify how
data flows when the job is run.
Linking Server Stages - >
Stream. A link representing the flow of data. This is the principal
type of link, and is used by both active and passive stages.
Reference. A link representing a table lookup. Reference links are
only used by active stages. They are used to provide information
that might affect the way data is changed, but do not supply the
data to be changed.
Linkning Parallel Stages ->
Stream. -> A link representing the flow of data. This is the principal
type of link, and is used by all stage types.
Reference.-> A link representing a table lookup. Reference links can
only be input to Lookup stages, they can only be output from
certain types of stage.
Reject. -> Some parallel job stages allow you to output records that
have been rejected for some reason onto an output link.
Parallel processing is the ability to carry out multiple operations or tasks simultaneously.
Pipeline Parallelism
->If we run a job on a system with at least three processors the stage reading would
start on one processor and start filling a pipeline with the data it had read.
->The transformation stage would start running on second processor as soon as there
was a data in a pipeline, process it and start filling another pipeline.
->The target stage would start running on 3rd processor as soon as there was data in
pipeline
all three Parallelism
stages are operating simultaneously.
Partitioning
-> Using Partitioning Parallelism the same job would effectively be run on simultaneously
by several processors.

BULK COPY PROGRAM: Microsoft SQL Server and Sybase have a utility called BCP (Bulk
Copy
Program). This command line utility copies SQL Server data to or from an
operating system file in a user-specified format. BCP uses the bulk copy
API in the SQL Server client libraries.
By using BCP, you can load large volumes of data into a table without
recording each insert in a log file. You can run BCP manually from a
command line using command line options (switches). A format (.fmt) file
is created which is used to load the data into the database.
The Orabulk stage is a plug-in stage supplied by Ascential. The Orabulk
plug-in is installed automatically when you install DataStage.
An Orabulk stage generates control and data files for bulk loading into a
single table on an Oracle target database. The files are suitable for loading
into the target database using the Oracle command sqlldr.
One input link provides a sequence of rows to load into an Oracle table.
The meta data for each input column determines how it is loaded. One
optional output link provides a copy of all input rows to allow easy
combination of this stage with other stages.
Lookup and join perform equivalent operations: combining two or more
input datasets based on one or more specified keys.
Lookup requires all but one (the first or primary) input to fit into physical
memory. Join requires all inputs to be sorted.
When one unsorted input is very large or sorting isnt feasible, lookup is
the preferred solution. When all inputs are of manageable size or are presorted,
join is the preferred solution.
These are the temporary variables created in transformer for calculation.
Routines are the functions which we develop in BASIC Code for required tasks, which we
Datastage is not fully supported (Complex).
These Parameters are used to provide Administrative access and change run time values
of the job.
EDIT > JOBPARAMETERSIn that Parameters Tab we can
define the name,prompt,type,value.

Stage Variable - An intermediate processing variable that retains value during read and
does not pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.
A fact table consists of measurements of business requirements and foreign keys of
dimensions tables as per business rules.
An entity represents a chunk of information. In relational databases, an entity often
maps to a table.
An attribute is a component of an entity and helps define the uniqueness of the entity. In
relational databases, an attribute maps to a column.
The entities are linked together using relationships.

hives, and

rve all user groups


al systems

ou can use to convert your data.

r. The job will connect to databases on other machines as necessary,

d by such jobs is then loaded into the data warehouse.

nown as jobs).
r DataStage server jobs and parallel jobs.
Repository.
such as setting up DataStage users,

data mart or data warehouse.


nto a data warehouse.
e jobs and plug-ins.

ta source to a data target (for example, a final data warehouse).


in the DataStage Designer depend on the type of job that is

astreams, aggregating data,

try standard Open Database


ng data. This is a passive stage.

e is also used as an intermediate

ssive stage.

e machine. This isintended for use on USS systems

tputs it to another stage in thejob.

data set.

uirements that you specify, andfilters out all other records.

put data set based on the value of aselector field.


an existing data set.

a vector.

JOB SEQUENCE

Job Sequence?

Activity Stages?

Triggers?

Job Sequence Properties?

Job Report

How do you generate Sequence number in


Datastage?

JOB SEQUENCE
DataStage provides a graphical Job Sequencer which allows you to specify
a sequence of server jobs or parallel jobs to run. The sequence can also
contain control information; for example, you can specify different courses
of action to take depending on whether a job in the sequence succeeds or
fails. Once you have defined a job sequence, it can be scheduled and run
using the DataStage Director. It appears in the DataStage Repository and
in the DataStage Director client as a job.
Job. Specifies a DataStage server or parallel job.
Routine. Specifies a routine. This can be any routine in
the DataStage Repository (but not transforms).
ExecCommand. Specifies an operating system command
to execute.
Email Notification. Specifies that an email notification
should be sent at this point of the sequence (uses SMTP).
Wait-for-file. Waits for a specified file to appear or disappear.
Exception Handler. There can only be one of these in a
job sequence. It is executed if a job in the sequence fails to
run (other exceptions are handled by triggers) or if the
job aborts and the Automatically handle job runs that
fail option is set for that job.
Nested Conditions. Allows you to further branch the
execution of a sequence depending on a condition.
Sequencer. Allows you to synchronize the control flow
of multiple activities in a job sequence.
Terminator. Allows you to specify that, if certain situations
occur, the jobs a sequence is running shut down
cleanly.
Start Loop and End Loop. Together these two stages
allow you to implement a ForNext or ForEach loop
within your sequence.
User Variable. Allows you to define variables within a
sequence. These variables can then be used later on in
the sequence, for example to set job parameters.

The control flow in the sequence is dictated by how you interconnect


activity icons with triggers.
There are three types of trigger:
Conditional. A conditional trigger fires the target activity if the
source activity fulfills the specified condition. The condition is
defined by an expression, and can be one of the following types:
OK. Activity succeeds.
Failed. Activity fails.
Warnings. Activity produced warnings.
ReturnValue. A routine or command has returned a value.
Custom. Allows you to define a custom expression.
User status. Allows you to define a custom status message to
write to the log.
Unconditional. An unconditional trigger fires the target activity
once the source activity completes, regardless of what other triggers
are fired from the same activity.
Otherwise. An otherwise trigger is used as a default where a
source activity has multiple output triggers, but none of the conditional
ones have fired.
General,Parameters,Job Control,Dependencies,NLS
The job reporting facility allows you to generate an HTML report of a
server, parallel, or mainframe job or shared containers. You can view this
report in a standard Internet browser (such as Microsoft Internet Explorer)
and print it from the browser.
The report contains an image of the job design followed by information
about the job or container and its stages. Hotlinks facilitate navigation
through the report. The following illustration shows the first page of an
example report, showing the job image and the contents list from which
you can link to more detailed job component descriptions:
report is not dynamic, if you change the job design you will need to
regenerate the report.

The

Using the Routine


KeyMgtGetNextVal
KeyMgtGetNextValConn
They can also be done by Oracle Sequence.

Scenarios

if suppose we have 3 jobs in sequencer, while running


if job1 is failed then we have to run job2 and job 3
,how we can run?

how do you remove duplicates using transformer


stage in datastage.

how you will call shell scripts in sequencers in


datastage

What are the Environmental variables in Datastage?

How to extract job parameters from a file?

How to get the unique records on multiple columns by


using sequential file stage only
if a column contains data like
abc,aaa,xyz,pwe,xok,abc,xyz,abc,pwe,abc,pwe,xok,xyz
,xxx,abc,
roy,pwe,aaa,xxx,xyz,roy,xok....
how to send the unique data to one source and
remaining data
to another source????

how do u reduce warnings?


Is there any possibility to generate alphanumeric
surrogate key?

How to lock\unlock the jobs as datastage admin?


How to enter a log in auditing table whenever a job
get finished?

what is Audit table? Have u use audit table in ur


project?
Can we use Round Robin for aggregator? Is there any
benefit underlying?
How many number of reject links merge stage can
have?
I have 3 jobs A,B and C , which are dependent each
other. I want to run A & C jobs daily and B job run only
on sunday. how can we do it?
How to generate surrogate key without using
surrogate key stage?
what is push and pull technique??? I want to two seq
files using push technique import in my desktop what i
will do?

what is .dsx files


how to capture rejected data by using join stage not
for lookup stage. please let me know?

What is APT_DUMP_SCORE?
ountry, state 2 tables r there. in table 1 have
cid,cname
table2 have sid,sname,cid. i want based on cid which
country's
having more than 25 states i want to display?

what is the difference between 7.1,7.5.2,8.1 versions


in datastage?

what is normalization and denormalization?

What is diff between Junk dimensions and conform


dimension?
30 jobs are running in unix.i want to find out my
job.how to do this?Give me command?
what are the different type of errors in datastage?

How do u convert the columns to rows in DataStage?

What is environment variables?


Where the DataStage stored his repository?

How one source columns or rows to be loaded in to


two different tables?

How do you register plug-ins?

Source has sequential file stage in 10 records and


move to transformer stage it has one output link 2
records and reject link has 5 records ? But I want
remaining 3 records how to capture

Scenarios
To run a job even if its previous job in the sequence is failed you need to go to the TRIGGER tab of
that particular job activity in the sequence itself.
There you will find three fields:
Name: This is the name of the next link (link goin to the next job, e.g. for job activity 1 link name will
be the link goin to job activity 2).
Expression Type: This will allow you to trigger your next job activity based on the status you want.
For example, if in case job 1 fails and you want to run the job 2 and job 3 then go to trigger
properties of the job 1 and select expression type as "Failed - (Conditional)". This way you can run
your job 2 even if your job 1 is aborted. There are many other options available.
Expression: This is editable for some options. Like for expression type "Failed" you can not change
this field.
I think this will solve your problem.

In that Time double click on transformer stage---> Go to Stage properties(its having in hedder line
first icon) ---->double click on stage properties --->Go to inputs ---->go to partitioning---->select one
partition technick(with out auto)--->now enable perform sort--->click on perfom sort----> now
enable unique---->click on that and we can take required colum name. now out put will come unique
values so here duplicats will be removed.

Shell scripts can be called in the sequences by using "Execute command activity". In this activity
type following command :
bash /path of your script/scriptname.sh
bash command is used to run the shell script.

The Environmental variables in datastage are some pathes which can support system can use as
shortcuts to fulfill the program running instead of doing nonsense activity. In most time,
environmental variables are defined when the software have been installed or being installed.
Could we use dsjob command on linux or unix plantform to achive the activity of extacting
parameters from a job?

In sequential file there is one option is there i.e filter.in this filter we use unix commands like what
ever we want.

By Using Sort Stage. GoTo Properties -> set Sorting Keys key=column name and set option Allow
Duplicate= false.
In order to reduce the warnings you need to get clear idea
about particular warning, if you get any idea on code or
design side you fix it, other wise goto director-->select warning and right click and add rule to
message, then click
ok. from next run onward you shouldn't find any warnings.

It is not possible to generate alphanumeric surrogate key


in datastage.
I think this answer might satisfy you..
1.just open administrator
2.Go to projects tab
3.click on command button.
4.Give list.readu command and press execute(It gives you all the jobs status
and please not the PID(Process ID) of those jobs which you want to unlock)
5.Now close that and again come back to command window.
6.now give the command ds.tools and execute
7.read the options given there.... and type "4" (option)
8.and now give 6/7 depending up on ur requirement...
9.Now give the PID that you have noted before..
10.Then "yes"
11.Generally at first time it won't work.. but if we press again 7 then
after that give PID again.. It ll work....
Please get back to me If any further clarifications req

some companies using shell script to load logs into audit table or some companies load logs into
audit table using datastage jobs. These jobs are we developed.

Audit table mean its log file.in every job should has audit
table.

Yes we can use Round Robin in Aggregator. It is used for Partitioning and Collecting.

we can have n-1 rejects for merge.

First you have to schedule A & C jobs Monday to Saturday in one sequence.Next take three jobs
according to dependency in one more sequence and schedule that jobonly Sunday.
by using the transformer we can do it.To generate seqnum
there is a formula by using the system variables
ie [@partation num + (@inrow num -1) * @num partation.

push means the source team sends the data and pull means
the developer extracts the data from source.
.dsx file is nothing but the datastage project backup file..
when we want to load the project at the another system or server we take the file and load at the
other system/server.

We can not capture the reject data by using join stage.


For that we can use transformer stage after join stage.
APT_DUMP_SCORE is an reporting environment variable , used to show how the data is processing
and processes are combining.

Join these two tables on cid and get all the columns to
output. Then in aggregator stage, count rows with key
collumn cid..Then use filter or transformer to get records
with count> 25

The main difference is in 7.5 we can open job only once at a system but in 8.1 we can open on job in
multiple time as a read only mode and another difference is in 8.1 having Slowly Changing
Dimention stage and Repository are there in 8.1.
IN Normalization is controlled by elimination redundant
data where as in Denormalisation is controlled by redundant
data.
JUNK DIMENSION
A Dimension which cannot be used to describe the facts is
known as junk dimension(junk dimension provides additional
information to the main dimension)
ex:-customer add
Confirmed Dimension
A dimension table which can be shared by multiple fact tables
is known as Confirmed dimension
Ex:- Time dimension

ps -ef|grep USER_ID|grep JOB_NAME

Using Pivot Stage .

Basically Environment variable is predefined variable those we can use while


creating DS job.We can set eithere as Project level or Job level.Once we set
specific variable that variable will be availabe into the project/job.
DataStage stored his repository in IBM Universe Database.
For Columns - We can directly map the single source columns to two different
targets.
For Rows - We have to put some constraint (condition ).

Using DataStage Manager. Tool-> Register Plugin -> Set Specific path and ->ok

DataStage Important Interview Ques

What is DatawareHouse? Concept of Dataware house?

What type of data available in Datawarehouse?

What is Node? What is Node Configurtion?

What is the use of Nodes

What is Apt_Conf_File?

What is Complex Job in Datastage.

What is Version Control in Datastage.

What are descriptor file and data file in Dataset.

What is Job Commit ( in Datastage).

What is Iconv and Oconv functions

How to Improve Performance of Datastage Jobs?

Difference between Server Jobs and Parallel Jobs

Difference between Datastage and Informatica.

What is complier ? Compliation Process in datastage

What is Modelling Of Datastage?

What is DataMart, Importance and Advantages?

Data Warehouse vs. Data Mart

What are different types of error in datastage?

What are the client components in DataStage 7.5x2 version?

Difference Between 7.5x2 And 8.0.1?

What is IBM Infosphere? And History

What is Datastage Project Contains?

What is Difference Between Hash And Modulus Technique?

What are Features of Datastage?

ETL Project Phase?

What is RCP?

What is Roles And Responsibilties of Software Engineer?

Server Component of DataStage 7.5x2 version?

How to create Group ID in Sort Stage?

What is Fastly Changing Dimension?

Force Compilation ?
how many rows sorted in sort stage by default in server jobs
when we have to go for a sequential file stage & for a
dataset in datastage?

what is the diff b/w switch and filter stage in datastage?

specify data stage strength?

symmetric multiprocessing (SMP)

Briefly state different between data ware house & data mart?

What are System variables?

What are Sequencers?

Whats difference betweeen operational data stage (ODS)


and data warehouse?

What is the difference between Hashfile and Sequential File?

What is OCI?
Which algorithm you used for your hashfile?

tant Interview Question And Answer


Datawarehouse is a database which is used to store the heterogeneous sources of data
with characteristics like
a) Stucture Oriented
b) Historical Information
c) Integrated
d) Non Volatile
e) Time Variant
Source will be Online Transaction Process ( OLTP). It collects the data from Online
Transaction Process ( OLTP). It maintains the data for 30 - 90 days. It is time sensitive. If
we like to store the data for long period, we need a permanent data base. That is Archyl
Database ( AD).
Data in the Datawarehouse comes from the client systems.Data that you are using to
manage your business is very important to do the manupulations according to the client
requirements.
Node is a Logical Cpu in datastage .
Each node in a configuration file is distinguished by the virtual name and defines a
number , speed, cpu's , memory availability etc.
Node configuration is a technique of creating logical C.P.U

In a Grid environment a node is the place where the jobs are executes.
Nodes are like processors , if we have more nodes when running the job , the
performance
will be good to run parallel to make the job efficient.

Apt_Config_file is a file which is used to identify the .apt files and we can store the
nodes, disk storage space And Apt_Config_File installed under the top level directory ( i.e
apt_orchhome config files ). The Size of the computer system on which you run jobs is
defined in the c files. You can find the c files in the Manager---Tool---Configuration And
node is the name of the processing node that this entry defines.
Complex jobs in datastage is nothing but having more Joins or lookups or transformer
stages in one job. There is no limitaion of using stages in a job. We can use any number
of stages in single job. But you need to reduce the stages where ever you can by writing
the queries in one stage, rather than using two stages. Than you will get good
performance.
If you are getting more stages in the job you have another technique to get good
performance. That is you can split the job into two jobs.

Version Control is used to store the different versions of datstage jobs. And it runs the
different versions of same jobs. It also reverts to previous version of a Job.
Descriptor and Data files are the dataset files.
Descriptor file contains the Schema details and address of the data.
And Data file contains the data in the native format.
In DRS Stage we have a transaction Isolation , set to read committed .
And set Array Sze and transaction size to 10,2000 . So that , it will commit for every 2000
records.

Iconv and Oconv functions are used to convert the date functions.
Iconv() is used to convert string to Internal storage format.
Oconv() is used to convert expression to an output format.

Performance of the Job is really important to maintain.Some of the precautions are as


follows to get good performance of the Jobs.Avoid the use of only one flow of tuning for
performance testing or tuning testing.Try to work in Increment. Isolate and solve the
Jobs. And Work in increment.
For that
a) Avoid using Transformer stage where ever necessary. For example if you are using
Transformer stage to change the column names or to drop the column names. Use Copy
stage, rather than using Transformer stage. It will give good performance to the Job.
b)Take care to take correct partitioning technique, according to the Job and
requirement.
c) Use User defined queries for extracting the data from databases .
d) If the data is less , use Sql Join statements rather then using a Lookup stage.
e) If you have more number of stages in the Job, divide the job into multiple jobs.

Server Jobs works only if the server jobs datastage has been installed in your system.
Server Jobs doesnot supports the parallelism and partition techniques. Server Jobs
generates basic programs after Job Compilation.
Parallel Jobs works, if you have installed Enterprise Edition. This works on the
Datastage Servers that are SMP (Symmetric Multi-Processing) , MPP ( Massively Parallel
Processing ) etc. Parallel Jobs generates OSH ( Orchestrate Shell ) Programs after job
compilation. Different Stages will be like datasets, lookup stages etc.
Server Jobs works in sequential way while parallel jobs work in parallel fashion (Parallel
Extender work on the principal of pipeline and partition) for Inpur/Output processing.
Difference between Datastage and Informatica is
Datastage is having Partition, Parallelism, Lookup , Merge etc
But Informtica Doesn't have this concept of partition and parallelism. File lookup is really
horrible
Compilation is the process of converting the GUI into its machine code .That is nothing
but machine understandable language.
In this process it will checks all the link requirements, stage mandatory property values,
and if there any logical errors.
And Compiler produces OSH Code.

Modeling is a Logical and physical representation of Source system.


Modeling have two types of Modeling Tools
They are
ERWIN AND ER-STUDIO
In Source System there will be a ER-Model and
in the Target system there will be a ER-Model and Dimensional Model
Dimension:- The table which was designed for the client perspective. We can see in
many ways in the Dimension tables.

And there are two types of Models.


They are
Forward Engineering (F.E)
Reverse Engineering (R.E)
F.E:- F.E is the process starting the process from the scratch for banking sector.
Ex: Any Bank which was required Datawarehouse.
R.E:- R.E is the process altering existing model for another bank.

A data mart is a repository of data gathered from operational data and other sources
that is designed to serve a particular community of knowledge workers. In scope, the
data may derive from an enterprise-wide database or data warehouse or be more
specialized. The emphasis of a data mart is on meeting the specific demands of a
particular group of knowledge users in terms of analysis, content, presentation, and easeof-use. Users of a data mart can expect to have data presented in terms that are familiar.

There are many reasons to create Datamart.There is lot of importance of Datamart and
advantages.
It is easy to access frequently needed data from the database when reuired by the client.
We can give access to group of users to view the Datamart when it is required. Ofcourse
performance will be good.
It is easy to maintain and to create the datamart. It will be related to specific business.
And It is low cost to create a datamart rather than creating datarehouse with a huge
space.

A data warehouse tends to be a strategic but somewhat unfinished concept. The


design of a data warehouse tends to start from an analysis of what data already exists
and how it can be collected in such a way that the data can later be used. A data
warehouse is a central aggregation of data (which can be distributed physically);
A data mart tends to be tactical and aimed at meeting an immediate need. The
design of a data mart tends to start from an analysis of user needs. A data mart is a data
repository that may derive from a data warehouse or not and that emphasizes ease of
access and usability for a particular designed purpose.

You may get many errors in datastage while compiling the jobs or running the jobs.
Some of the errors are as follows
a)Source file not found. If you are trying to read the file, which was not there with that
name.
b)Some times you may get Fatal Errors.
c) Data type mismatches. This will occur when data type mismatches occurs in the jobs.
d) Field Size errors.
e) Meta data Mismatch
f) Data type size between source and target different
g) Column Mismatch
i) Pricess time out. If server is busy. This error will come some time.

In Datastage 7.5X2 Version, they are 4 client Components. They are


1) Datastage Designer
2) Datastage Director
3) Datastage Manager
4) Datastage Admin
In Datastage Designer, We
Create the Jobs
Compile the Jobs
Run the Jobs
In Director, We can
View the Jobs
View the Logs
Batch Jobs
Unlock Jobs
Scheduling Jobs
Monitor the JOBS
Message Handling

In Manager , We can

1) In Datastage 7.5X2, there are 4 client components. They are


a) Datastage Design
b) Datastage Director
c Datastage Manager
d) Datastage Admin
And in
2) Datastage 8.0.1 Version, there are 5 components. They are
a) Datastage Design
b) Datastage Director
c) Datastage Admin
d) Web Console
e) Information Analyzer
Here Datastage Manager will be integrated with the Datastage Design option.

2) Datastage 7.X.2 Version is OS Dependent. That is OS users are Datastage Users.


and in 8.0.1
2)This is OS Independent . That is User can be created at Datastage, but one time
dependant.

Datastage is the product owned by I.B.M


Datastage is a ETL Tool an it is independent of platform.
Etl means Extraction , Transformation and loading the jobs.
Datastage is the product introduced by the company called V-mark with the name
DataIntegrator in UK in the year 1997.
And later it was acquired by other companies. Finally it was reached to I.B.M in 2006.
Datastage got parallel capabilities when it was integrated with the Orchestrate file
and got independent platform capabilities when integrated with the MKS Tool Kit

Datastage is a Comprehensive ETL Tool. It is used to Extract , transformation and loading


the Jobs. Datastage Project will be worked on the Datastage don't. We can login to the
Datastage Designer in order to enter the Datastage too for datastage jobs, designing of
the jobs etc.
Datastage jobs are maintained according to the project standards.
In every project we contain the Datastage Jobs , Built in Components , Table Definitions ,
Repository and components required for the project.

Hash and Modulus techniques are Key based partition techniques.


Hash and Modulus techniques are used for different purpose.
If Key column data type is textual then we use hash partition technique for the job.
If Key column data type is numeric, we use modulus partition technique.
If one key column numeric and another text then also we use hash partition technique.
if both the key columns are numeric data type then we use modulus partition technique.

1)Any to Any
That means Datastage can Extrace the data from any source and can loads the data into
the any target.
2) Platform Independent
The Job developed in the one platform can run on the any other platform. That means if
we designed a job in the Uni level processing, it can be run in the SMP machine.
3 )Node Configuration
Node Configuration is a technique to create logical C.P.U
Node is a Logical C.P.U
4)Partition Parallelism
Partition parallelim is a technique distributing the data across the nodes based on the
partition techniques. Partition Techniques are
a) Key based Techniques are
1 ) Hash 2)Modulus 3) Range 4) DB2
b) Key less Techniques are
1 ) Same 2) Entire 3) Round Robin 4 ) Random
5) Pipeline Parallelism

And four phases are


1) Data Profiling
2) Data Quality
3) Data Transformation
4) Meta data management
Data Profiling:Data Profiling performs in 5 steps. Data Profiling will analysis weather the source data is
good or dirty or not.
And these 5 steps are
a) Column Analysis
b) Primary Key Analysis
c) Foreign Key Analysis
d) Cross domain Analysis
e) Base Line analysis
After completing the Analysis, if the data is good not a problem. If your data is dirty, it
will be sent for cleansing. This will be done in the second phase.
Data Quality:Data Quality, after getting the dirty data it will clean the data by using 5 different ways.
They are
a) Parsing
b) Correcting
c) Standardize

RCP is nothing but Runtime Column Propagation. When we run the Datastage Jobs, the
columns may change from one stage to another stage. At that point of time we will be
loading the unnecessary columns in to the stage, which is not required. If we want to
load the required columns to load into the target, we can do this by enabling a RCP. If
we enable RCP, we can sent the required columns into the target.

Roles and Responsibilities of Software Engineer are


1) Preparing Questions
2) Logical Designs ( i.e Flow Chart )
3) Physical Designs ( i.e Coding )
4) Unit Testing
5) Performance Tuning.
6) Peer Review
7) Design Turnover Document or Detailed Design Document or Technical design
Document
8) Doing Backups
9) Job Sequencing ( It is for Senior Developer )
There are three Architecture Components in datastage 7.5x2
They are
Repository:-Repository is an environment where we create job, design, compile and run etc.
Some Components it contains are
JOBS,TABLE DEFINITIONS,SHARED CONTAINERS, ROUTINES ETC
Server( engine):-- Here it runs executable jobs that extract , transform, and
load data into a datawarehouse.
Datastage Package Installer:-It is a user interface used to install packaged datastage jobs and plugins.

Group ids are created in two different ways. We can create group id's by using
a) Key Change Column
b) Cluster Key change Column
Both of the options used to create group id's .
When we select any option and keep true. It will create the Group id's group wise.
Data will be divided into the groups based on the key column and it will give (1) for the
first row of every group and (0) for rest of the rows in all groups.
Key change column and Cluster Key change column used, based on the data we are
getting from the source.
If the data we are getting is not sorted , then we use key change column to create group
id's
If the data we are getting is sorted data, then we use Cluster Key change Column to
create Group Id's .
The Entities in the Dimension which are change rapidly is
called Rapidly(fastly) changing dimention. best example is
atm machine transactions.
For parallel jobs there is also a force compile option. The compilation of
parallel jobs is by default optimized such that transformer stages only get
recompiled if they have changed since the last compilation. The force
compile option overrides this and causes all transformer stages in the job
to be compiled. To select this option:
Choose File Force Compile
10,000
When there is Memory limit is requirement is more, then go for Dataset, And sequential
file doesnt support more than 2gb.

filter:1)we can write the multiple conditions on multiple


fields
2)it supports one inputlink and n number of outputlinks
Switch:1)multiple conditions on a single field(column)
2)it supports one inputlink and 128 output links

The major strength of the datastage are :


Partitioning,
pipelining,
Node configuration,
handles Huge volume of data,
Platform independent.
symmetric multiprocessing (SMP) involves a multiprocessor computer hardware
architecture where two or more identical processors are connected to a single shared
main memory and are controlled by a single OS instance. Most common multiprocessor
systems today use an SMP architecture.
Data warehouse is made up of many datamarts. DWH contain many subject areas.
However, data mart focuses on one subject area generally. E.g. If there will be DHW of
bank then there can be one data mart for accounts, one for Loans etc. This is high-level
definitions.
A data mart (DM) is the access layer of the data warehouse (DW) environment that
is used to get data out to the users. The DM is a subset of the DW, usually oriented to a
specific business line or team.
System variables comprise of a set of variables which are used to get system information
and they can be accessed from a transformer or a routine. They are read only and start
with an @.

A sequencer allows you to synchronize the control flow of multiple activities in a job
sequence. It can have multiple input triggers as well as multiple output triggers.

A dataware house is a decision support database for organisational needs.It is subject


oriented,non volatile,integrated ,time varient collect of data.
ODS(Operational Data Source) is a integrated collection of related information . it
contains maximum 90 days information.
ODS is nothing but operational data store is the part of transactional database. this db
keeps integrated data from different tdb and allow common operations across
organisation. eg: banking transaction.
In simple terms ODS is dynamic data.
Hash file stores the data based on hash algorithm and on a key value.
A sequential file is just a file with no key column.
Hash file used as a reference for look up.
Sequential file cannot.
If you mean by Oracle Call Interface (OCI), it is a set of low-level APIs used to interact
with Oracle databases. It allows one to use operations like logon, execute, parss etc.
using a C or C++ program.
It uses GENERAL or SEQ.NUM. algorithm

You might also like