Professional Documents
Culture Documents
EG.) all primary records in this month will be populated to one partition.
Modulus:-partitioned data will provide some information. In this sense customers related to one store will populated to
one partition.
If the key field is numeric use modules else hash partition.as per performance tuning.Modulus is nothing but Modulus
in Maths, so it can be performed only on Numeric Data Fields, Hash can be used for
any kind of data fileds, it will assign similar values in the portioning.
Modulus Hash
1. For numerics 1. For Numerics and characters
2. Datatype specific 2. Not Datatype spefic
Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column.
Hash file used as a reference for look up. Sequential file cannot.Difference between Hashfile and sequential file is ,
searching a record is too fast in hash file based on the hashkey, we can get the address of record directly in hashfile
based on the hashkey, and in sequential file it should search record sequential mode only, it has to search for record by
record, and we can remove duplicate records based on the hash key in hashfile, we cannot in sequential file.
What is the difference between sequential file and a dataset? When to use the copy stage?
Sequentiial Stage stores small amount of the data with any extension in order to acces the file where as DataSet is used
to store Huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data
set to a number of output datasets. Each record of the input data set is copied to every output data set.Records can be
copied without modification or you can drop or change theorder of columns.
Seq file : u can view the data in Unix box or View data button.
U can delete the file in unix like rm file name.
DataSet : U can't see the data in dataset stage. even cant see the data in unix box also
But if u want to delete the data u can use $orchadmin < delete | del | rm > [-f | -x] descriptorfiles?.
What is the Difference Between DataStage 7.5 version and 8.1 Version?
1. in ds 7.5.2 we have manager as client. in 8.0.1 we dont have any manager client. the manager client is embeded in
designer client.
2. in 7.5.2 quality stage has seperate designer .in 8.0.1 quality stage is integrated in designer.
3. in 7.5.2 we required operating system authentications. in 8.0.1 we requiree operating system authentications and
datastage authentications.
4. in 7.5.2 we dont have range lookup. in8.0.1 we have range lookup.
5. in 7..5.2 a single join stage can't support multiple references. in 8.0.1 a single join stage can support multiple
references.
6. in 7.5.2 , when a developer opens a particular job, and another developer wants to open the same job , that job
can't be opend. in 8.0. it can be possible when a developer opens a particular job and another developer wants to open
the same job then it can be opend as read only job.
7. in 8.0.1 a compare utility is avilable to compare 2 jobs , one in development another is in production. in
7.5.2 it is not possible.
8. in 8 we have scd stage but 7 we don’t have.
9. in 8.0.1 quick find and advance find features are avilable , in 7.5.2 not available.
10. in 7.5.2 first time one job is run and surogate key s generated from initial to n value. next time the same job is
compile and run again surrogate key is generated from initial to n. automatic increment of surrogate key is not in 7.5.2.
but in 8.0.1 surrogate key is incremented automatically.a state file is used to store the maximum value of surrogate key.
Px takes advantage of both pipeline parallelism and partitoning paralellism. Pipeline parallelism
means that as soon as data is available between stages( in pipes or links), it can be exchanged between them without
waiting for the entire record set to be read. Partitioning parallelism means that entire record set is partitioned into small
sets and processed on different nodes(logical processors). For example if there are 100 records, then if there are 4
logical nodes then each node would process 25 records each. This enhances the speed at which loading takes place to
an amazing degree. Imagine situations where billions of records have to be loaded daily. This is where datastage PX
comes as a boon for ETL process and surpasses all other ETL tools in the market.
What is the difference between stages and operators?
--- Stages are generic user interface from where we can read and write from files and databases, trouble shoot and
develop jobs, also it's capable of doing processing of data.
--- Operators are the basic functional units of an orchestrate application. In orchestrate framework DataStage stages
generates an orchestrate operator directly.
job seq is used to run the group of jobs based upon some conditions. For final/incremental processing we keep all the
jobs in one diff seq and we run the jobs at a time by giving some triggers.