Professional Documents
Culture Documents
Data Extraction ?
Data Aggregation ?
Data Transformation ?
DataStage ?
Client Component ?
Server Component ?
DataStage Jobs ?
DataStage NLS ?
Stages
Passive Stage ?
Active Stage ?
Processing
Development/Debug Stages
Row Generator. -> Generates a dummy data set.
Column Generator. -> Adds extra columns to a data set.
File Stages
Complex Flat File. -> Allows you to read or writecomplex
Data set.-> Stores a set of data.
External source. -> Allows a parallel job to read anextern
External target. -> Allows a parallel job to write to anext
File set. -> A set of files used to store data.
Lookup file set. ->Provides storage for a lookup table.
SAS data set. -> Provides storage for SAS data sets.
Sequential file. -> Extracts data from, or writes data to, a
Processing Stages
Transformer. - >Receives incoming data, transforms it in
Restructure
Column export. -> Exports a column of another type to a
Column import. -> Imports a column from a string orbin
Combine records. -> Combines several columns associat
Make subrecord. -> Combines a number of vectors to fo
Make vector. -> Combines a number of fields to form ave
Promote subrecord. -> Promotes the members of asubre
Split subrecord. -> Separates a number of subrecords int
Split vector. -> Separates a number of vector members in
Other Stages
Links ?
Parallel Processing
Types of Parallelism
Plug in Stage?
DataStage Designer
A data warehouse is a central integrated database containing data from all
the operational sources and archive systems in an organization. It contains
a copy of transaction data specifically structured for query analysis.
This database can be accessed by all users, ensuring that each group in an organization
is accessing valuable, stable data.
Operational databases are usually accessed by many concurrent users. The
data in the database changes quickly and often. It is very difficult to obtain
an accurate picture of the contents of the database at any one time.
Because operational databases are task oriented, for example, stock inventory
systems, they are likely to contain dirty data. The high throughput
of data into operational databases makes it difficult to trap mistakes or
incomplete entries. However, you can cleanse data before loading it into a
data warehouse, ensuring that you store only good complete records.
Data extraction is the process used to obtain data from operational sources, archives, and
external data sources.
The summed (aggregated) total is stored in the data warehouse. Because
the number of records stored in the data warehouse is greatly reduced, it
is easier for the end user to browse and analyze the data.
Transformation is the process that converts data to a required definition and value.
Data is transformed using routines based on a transformation rule, for
example, product codes can be mapped to a common format using a transformation
rule that applies only to product codes.
After data has been transformed it can be loaded into the data warehouse
in a recognized and required format.
Capitalizes on the potential value of the organizations information
Improves the quality and accessibility of data
Combines valuable archive data with the latest data in operational sources
Increases the amount of information available to users
DataStage mainframe jobs are compiled and run on a mainframe. Data extracted by such jobs is th
DataStage Designer -> A design interface used to create DataStageapplications (known as jobs).
DataStage Director-> A user interface used to validate, schedule,run, and monitor DataStage server
DataStage Manager -> A user interface used to view and edit thecontents of the Repository.
DataStage Administrator -> A user interface used to perform administrationtasks such as setting up
creating and moving projects, and setting up purging criteria.
Repository -> A central store that contains all the informationrequired to build a data mart or data w
DataStage Server -> Runs executable jobs that extract, transform,and load data into a data warehou
Datastage Package Installer -> A user interface used to install packagedDataStage jobs and plug-ins.
Basic type of DataStage
Server Jobs ->These are compiled and run on the DataStage server.
A server job will connect to databases on other machines as necessary,
extract data, process it, then write the data to the target data
warehouse.
Parallel Jobs -> These are compiled and run on the DataStage server
in a similar way to server jobs, but support parallel processing on
SMP, MPP, and cluster systems.
MainFrame Jobs -> These are available only if you have Enterprise
MVS Edition installed. A mainframe job is compiled and run on
the mainframe. Data extracted by such jobs is then loaded into the
data warehouse.
Shared Containers -> These are reusable job elements. They typically
comprise a number of stages and links. Copies of shared containers
can be used in any number of server jobs or parallel jobs and edited
as required.
Job Sequences -> A job sequence allows you to specify a sequence of
DataStage jobs to be executed, and actions to take depending on
results.
Built in Stages -> Supplied with DataStage and used for extracting,
aggregating, transforming, or writing data. All types of job have
these stages.
Plug in Stages-> Additional stages that can be installed in DataStage
to perform specialized tasks that the built-in stages do not support.
Server jobs and parallel jobs can make use of these.
Job Sequences Stages-> Special built-in stages which allow you to
define sequences of activities to run. Only Job Sequences have
these.
DataStage has built-in National Language Support (NLS). With NLS installed, DataStage
can do the following:
Process data in a wide range of languages
Accept data in any character set into most DataStage fields
Use local formats for dates, times, and money (server jobs)
Sort data according to local rules
Convert data between different encodings of the same language
(for example, for Japanese it can convert JIS to EUC)
A job consists of stages linked together which describe the flow of datafrom a data source to a data
The different types of job have different stage types. The stages that areavailable in the DataStage D
currently open in the Designer.
A passive stage handlesaccess to databases for the extraction or writing of data.
Active stagesmodel the flow of data and provide mechanisms for combining datastreams, aggregati
and converting data from one data type to another.
Database
ODBC. -> Extracts data from or loads data into databases that support the industry standard Open
Connectivity API. -> This stage is also used as an intermediate stage for aggregating data. This is a p
UniVerse. -> Extracts data from or loads data into UniVerse databases. This stage is also used as an
stage for aggregating data. This is a passive stage.
UniData. -> Extracts data from or loads data into UniData databases. This is a passive stage.
Oracle 7 Load. - > Bulk loads an Oracle 7 database. Previously known as ORABULK.
Sybase BCP Load. - >Bulk loads a Sybase 6 database. Previously known as BCPLoad.
File
Hashed File. -> Extracts data from or loads data into databases
that contain hashed files. Also acts as an
intermediate stage for quick lookups. This is a passive stage.
Sequential File. -> Extracts data from, or loads data into,
operating system text files. This is a passive stage.
Processing
Aggregator.-> Classifies inc oming data into groups,
computes totals and other summary functions for each
group, and passes them to another stage in the job. This
is an active stage.
BASIC Transformer. -> Receives incoming data, transforms
it in a variety of ways, and outputs it to another
stage in the job. This is an active stage.
Folder. -> Folder stages are used to read or write data as
files in a directory located on the DataStage server.
Inter-process. ->Provides a communication channel
between DataStage processes running simultaneously in
the same job. This is a passive stage.
Link Partitioner. -> Allows you to partition a data set into
up to 64 partitions. Enables server jobs to run in parallel
on SMP systems. This is an active stage.
File Stages
Complex Flat File. -> Allows you to read or writecomplex flat files on a mainframe machine. This isin
Data set.-> Stores a set of data.
External source. -> Allows a parallel job to read anexternal data source.
External target. -> Allows a parallel job to write to anexternal data source.
File set. -> A set of files used to store data.
Lookup file set. ->Provides storage for a lookup table.
SAS data set. -> Provides storage for SAS data sets.
Sequential file. -> Extracts data from, or writes data to, atext file.
Processing Stages
Transformer. - >Receives incoming data, transforms it in avariety of ways, and outputs it to another
Aggregator. -> Classifies incoming data into groups,
computes totals and other summary functions for each
group, and passes them to another stage in the job.
Change apply. -> Applies a set of captured changes to a data set.
Change Capture. -> Compares two data sets and recordsthe differences between them.
Compare. -> Performs a column by column compare oftwo pre-sorted data sets.
Compress. -> Compresses a data set.
Copy . -> Copies a data set.
Decode. -> Uses an operating system command to decodea previously encoded data set.
Difference. -> Compares two data sets and works out the difference between them.
Encode. -> Encodes a data set using an operating systemcommand.
Expand. -> Expands a previously compressed data set.
External Filter. -> Uses an external program to filter a dataset.
Filter. -> Transfers, unmodified, the records of the inputdata set which satisfy requirements that you
BULK COPY PROGRAM: Microsoft SQL Server and Sybase have a utility called BCP (Bulk
Copy
Program). This command line utility copies SQL Server data to or from an
operating system file in a user-specified format. BCP uses the bulk copy
API in the SQL Server client libraries.
By using BCP, you can load large volumes of data into a table without
recording each insert in a log file. You can run BCP manually from a
command line using command line options (switches). A format (.fmt) file
is created which is used to load the data into the database.
The Orabulk stage is a plug-in stage supplied by Ascential. The Orabulk
plug-in is installed automatically when you install DataStage.
An Orabulk stage generates control and data files for bulk loading into a
single table on an Oracle target database. The files are suitable for loading
into the target database using the Oracle command sqlldr.
One input link provides a sequence of rows to load into an Oracle table.
The meta data for each input column determines how it is loaded. One
optional output link provides a copy of all input rows to allow easy
combination of this stage with other stages.
Lookup and join perform equivalent operations: combining two or more
input datasets based on one or more specified keys.
Lookup requires all but one (the first or primary) input to fit into physical
memory. Join requires all inputs to be sorted.
When one unsorted input is very large or sorting isnt feasible, lookup is
the preferred solution. When all inputs are of manageable size or are presorted,
join is the preferred solution.
These are the temporary variables created in transformer for calculation.
Routines are the functions which we develop in BASIC Code for required tasks, which we
Datastage is not fully supported (Complex).
These Parameters are used to provide Administrative access and change run time values
of the job.
EDIT > JOBPARAMETERSIn that Parameters Tab we can
define the name,prompt,type,value.
Stage Variable - An intermediate processing variable that retains value during read and
does not pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.
A fact table consists of measurements of business requirements and foreign keys of
dimensions tables as per business rules.
An entity represents a chunk of information. In relational databases, an entity often
maps to a table.
An attribute is a component of an entity and helps define the uniqueness of the entity. In
relational databases, an attribute maps to a column.
The entities are linked together using relationships.
hives, and
nown as jobs).
r DataStage server jobs and parallel jobs.
Repository.
such as setting up DataStage users,
ssive stage.
data set.
a vector.
JOB SEQUENCE
Job Sequence?
Activity Stages?
Triggers?
Job Report
JOB SEQUENCE
DataStage provides a graphical Job Sequencer which allows you to specify
a sequence of server jobs or parallel jobs to run. The sequence can also
contain control information; for example, you can specify different courses
of action to take depending on whether a job in the sequence succeeds or
fails. Once you have defined a job sequence, it can be scheduled and run
using the DataStage Director. It appears in the DataStage Repository and
in the DataStage Director client as a job.
Job. Specifies a DataStage server or parallel job.
Routine. Specifies a routine. This can be any routine in
the DataStage Repository (but not transforms).
ExecCommand. Specifies an operating system command
to execute.
Email Notification. Specifies that an email notification
should be sent at this point of the sequence (uses SMTP).
Wait-for-file. Waits for a specified file to appear or disappear.
Exception Handler. There can only be one of these in a
job sequence. It is executed if a job in the sequence fails to
run (other exceptions are handled by triggers) or if the
job aborts and the Automatically handle job runs that
fail option is set for that job.
Nested Conditions. Allows you to further branch the
execution of a sequence depending on a condition.
Sequencer. Allows you to synchronize the control flow
of multiple activities in a job sequence.
Terminator. Allows you to specify that, if certain situations
occur, the jobs a sequence is running shut down
cleanly.
Start Loop and End Loop. Together these two stages
allow you to implement a ForNext or ForEach loop
within your sequence.
User Variable. Allows you to define variables within a
sequence. These variables can then be used later on in
the sequence, for example to set job parameters.
The
Scenarios
What is APT_DUMP_SCORE?
ountry, state 2 tables r there. in table 1 have
cid,cname
table2 have sid,sname,cid. i want based on cid which
country's
having more than 25 states i want to display?
Scenarios
To run a job even if its previous job in the sequence is failed you need to go to the TRIGGER tab of
that particular job activity in the sequence itself.
There you will find three fields:
Name: This is the name of the next link (link goin to the next job, e.g. for job activity 1 link name will
be the link goin to job activity 2).
Expression Type: This will allow you to trigger your next job activity based on the status you want.
For example, if in case job 1 fails and you want to run the job 2 and job 3 then go to trigger
properties of the job 1 and select expression type as "Failed - (Conditional)". This way you can run
your job 2 even if your job 1 is aborted. There are many other options available.
Expression: This is editable for some options. Like for expression type "Failed" you can not change
this field.
I think this will solve your problem.
In that Time double click on transformer stage---> Go to Stage properties(its having in hedder line
first icon) ---->double click on stage properties --->Go to inputs ---->go to partitioning---->select one
partition technick(with out auto)--->now enable perform sort--->click on perfom sort----> now
enable unique---->click on that and we can take required colum name. now out put will come unique
values so here duplicats will be removed.
Shell scripts can be called in the sequences by using "Execute command activity". In this activity
type following command :
bash /path of your script/scriptname.sh
bash command is used to run the shell script.
The Environmental variables in datastage are some pathes which can support system can use as
shortcuts to fulfill the program running instead of doing nonsense activity. In most time,
environmental variables are defined when the software have been installed or being installed.
Could we use dsjob command on linux or unix plantform to achive the activity of extacting
parameters from a job?
In sequential file there is one option is there i.e filter.in this filter we use unix commands like what
ever we want.
By Using Sort Stage. GoTo Properties -> set Sorting Keys key=column name and set option Allow
Duplicate= false.
In order to reduce the warnings you need to get clear idea
about particular warning, if you get any idea on code or
design side you fix it, other wise goto director-->select warning and right click and add rule to
message, then click
ok. from next run onward you shouldn't find any warnings.
some companies using shell script to load logs into audit table or some companies load logs into
audit table using datastage jobs. These jobs are we developed.
Audit table mean its log file.in every job should has audit
table.
Yes we can use Round Robin in Aggregator. It is used for Partitioning and Collecting.
First you have to schedule A & C jobs Monday to Saturday in one sequence.Next take three jobs
according to dependency in one more sequence and schedule that jobonly Sunday.
by using the transformer we can do it.To generate seqnum
there is a formula by using the system variables
ie [@partation num + (@inrow num -1) * @num partation.
push means the source team sends the data and pull means
the developer extracts the data from source.
.dsx file is nothing but the datastage project backup file..
when we want to load the project at the another system or server we take the file and load at the
other system/server.
Join these two tables on cid and get all the columns to
output. Then in aggregator stage, count rows with key
collumn cid..Then use filter or transformer to get records
with count> 25
The main difference is in 7.5 we can open job only once at a system but in 8.1 we can open on job in
multiple time as a read only mode and another difference is in 8.1 having Slowly Changing
Dimention stage and Repository are there in 8.1.
IN Normalization is controlled by elimination redundant
data where as in Denormalisation is controlled by redundant
data.
JUNK DIMENSION
A Dimension which cannot be used to describe the facts is
known as junk dimension(junk dimension provides additional
information to the main dimension)
ex:-customer add
Confirmed Dimension
A dimension table which can be shared by multiple fact tables
is known as Confirmed dimension
Ex:- Time dimension
Using DataStage Manager. Tool-> Register Plugin -> Set Specific path and ->ok
What is Apt_Conf_File?
What is RCP?
Force Compilation ?
how many rows sorted in sort stage by default in server jobs
when we have to go for a sequential file stage & for a
dataset in datastage?
Briefly state different between data ware house & data mart?
What is OCI?
Which algorithm you used for your hashfile?
In a Grid environment a node is the place where the jobs are executes.
Nodes are like processors , if we have more nodes when running the job , the
performance
will be good to run parallel to make the job efficient.
Apt_Config_file is a file which is used to identify the .apt files and we can store the
nodes, disk storage space And Apt_Config_File installed under the top level directory ( i.e
apt_orchhome config files ). The Size of the computer system on which you run jobs is
defined in the c files. You can find the c files in the Manager---Tool---Configuration And
node is the name of the processing node that this entry defines.
Complex jobs in datastage is nothing but having more Joins or lookups or transformer
stages in one job. There is no limitaion of using stages in a job. We can use any number
of stages in single job. But you need to reduce the stages where ever you can by writing
the queries in one stage, rather than using two stages. Than you will get good
performance.
If you are getting more stages in the job you have another technique to get good
performance. That is you can split the job into two jobs.
Version Control is used to store the different versions of datstage jobs. And it runs the
different versions of same jobs. It also reverts to previous version of a Job.
Descriptor and Data files are the dataset files.
Descriptor file contains the Schema details and address of the data.
And Data file contains the data in the native format.
In DRS Stage we have a transaction Isolation , set to read committed .
And set Array Sze and transaction size to 10,2000 . So that , it will commit for every 2000
records.
Iconv and Oconv functions are used to convert the date functions.
Iconv() is used to convert string to Internal storage format.
Oconv() is used to convert expression to an output format.
Server Jobs works only if the server jobs datastage has been installed in your system.
Server Jobs doesnot supports the parallelism and partition techniques. Server Jobs
generates basic programs after Job Compilation.
Parallel Jobs works, if you have installed Enterprise Edition. This works on the
Datastage Servers that are SMP (Symmetric Multi-Processing) , MPP ( Massively Parallel
Processing ) etc. Parallel Jobs generates OSH ( Orchestrate Shell ) Programs after job
compilation. Different Stages will be like datasets, lookup stages etc.
Server Jobs works in sequential way while parallel jobs work in parallel fashion (Parallel
Extender work on the principal of pipeline and partition) for Inpur/Output processing.
Difference between Datastage and Informatica is
Datastage is having Partition, Parallelism, Lookup , Merge etc
But Informtica Doesn't have this concept of partition and parallelism. File lookup is really
horrible
Compilation is the process of converting the GUI into its machine code .That is nothing
but machine understandable language.
In this process it will checks all the link requirements, stage mandatory property values,
and if there any logical errors.
And Compiler produces OSH Code.
A data mart is a repository of data gathered from operational data and other sources
that is designed to serve a particular community of knowledge workers. In scope, the
data may derive from an enterprise-wide database or data warehouse or be more
specialized. The emphasis of a data mart is on meeting the specific demands of a
particular group of knowledge users in terms of analysis, content, presentation, and easeof-use. Users of a data mart can expect to have data presented in terms that are familiar.
There are many reasons to create Datamart.There is lot of importance of Datamart and
advantages.
It is easy to access frequently needed data from the database when reuired by the client.
We can give access to group of users to view the Datamart when it is required. Ofcourse
performance will be good.
It is easy to maintain and to create the datamart. It will be related to specific business.
And It is low cost to create a datamart rather than creating datarehouse with a huge
space.
You may get many errors in datastage while compiling the jobs or running the jobs.
Some of the errors are as follows
a)Source file not found. If you are trying to read the file, which was not there with that
name.
b)Some times you may get Fatal Errors.
c) Data type mismatches. This will occur when data type mismatches occurs in the jobs.
d) Field Size errors.
e) Meta data Mismatch
f) Data type size between source and target different
g) Column Mismatch
i) Pricess time out. If server is busy. This error will come some time.
In Manager , We can
1)Any to Any
That means Datastage can Extrace the data from any source and can loads the data into
the any target.
2) Platform Independent
The Job developed in the one platform can run on the any other platform. That means if
we designed a job in the Uni level processing, it can be run in the SMP machine.
3 )Node Configuration
Node Configuration is a technique to create logical C.P.U
Node is a Logical C.P.U
4)Partition Parallelism
Partition parallelim is a technique distributing the data across the nodes based on the
partition techniques. Partition Techniques are
a) Key based Techniques are
1 ) Hash 2)Modulus 3) Range 4) DB2
b) Key less Techniques are
1 ) Same 2) Entire 3) Round Robin 4 ) Random
5) Pipeline Parallelism
RCP is nothing but Runtime Column Propagation. When we run the Datastage Jobs, the
columns may change from one stage to another stage. At that point of time we will be
loading the unnecessary columns in to the stage, which is not required. If we want to
load the required columns to load into the target, we can do this by enabling a RCP. If
we enable RCP, we can sent the required columns into the target.
Group ids are created in two different ways. We can create group id's by using
a) Key Change Column
b) Cluster Key change Column
Both of the options used to create group id's .
When we select any option and keep true. It will create the Group id's group wise.
Data will be divided into the groups based on the key column and it will give (1) for the
first row of every group and (0) for rest of the rows in all groups.
Key change column and Cluster Key change column used, based on the data we are
getting from the source.
If the data we are getting is not sorted , then we use key change column to create group
id's
If the data we are getting is sorted data, then we use Cluster Key change Column to
create Group Id's .
The Entities in the Dimension which are change rapidly is
called Rapidly(fastly) changing dimention. best example is
atm machine transactions.
For parallel jobs there is also a force compile option. The compilation of
parallel jobs is by default optimized such that transformer stages only get
recompiled if they have changed since the last compilation. The force
compile option overrides this and causes all transformer stages in the job
to be compiled. To select this option:
Choose File Force Compile
10,000
When there is Memory limit is requirement is more, then go for Dataset, And sequential
file doesnt support more than 2gb.
A sequencer allows you to synchronize the control flow of multiple activities in a job
sequence. It can have multiple input triggers as well as multiple output triggers.