You are on page 1of 25

In which two situations would you use the Web Services Client stage?

DK
You need the Web service to act as either a data source or a data target during an
operation.
You do not need both input and output links in a single web service operation.

#. You need to invoke a job from the command line that is a multi-instance
enabled. What is the correct syntax to start a multi-instance job?DK
dsjob -run -mode NORMAL .

# A client must support multiple languages in selected text columns when
reading from DB2 database. Which two actions will allow selected columns to
support such data?
Choose Unicode setting in the extended column attribute.
Choose NVar/NVarchar as data types.

# Which two system variables/techniques must be used in a parallel Transformer
derivation to generate a unique sequence of integers across partitions?
@PARTITIONNUM
@NUMPARTITIONS

# You are experiencing performance issues for a given job. You are assigned the
task of understanding what is happening at run time for the given job. What are
the first two steps you should take to understand the job performance issues .
Run job with $APT_TRACE_RUN set to true.
Review the objectives of the job.

# Your customer asks you to identify which stages in a job are consuming the
largest amount of CPU time. Which product feature would help identify these
stages?
$APT_PM_PLAYER_TIMING

# Unix Command to stop Datastage Engine ?DK
bin/uv admin stop

# Unix Command to start Datastage Engine ?DK
bin/uv admin start

# Unix command to check datastage jobs running at server ?DK
ps -ef | grep phantom
# Unix Command to check Datastage sessions running at backend ?
netstat na | grep dsr
netstat a | grep dsr
netstat a | grep dsrpc
# How to unlock a Datastage job ?
Cleanup Resourses in Director
Clear Status File in Director
DS.Tools in Administrator
DS.Tools in Unix

# Command to check the Datastage Job Status ?
dsjob -status

# Part of configuration File ?
Node
ServerName
Pools
FastName
ResourceDisk

# Where datastage temprory dataset files stored while running a Datastage
parallel Job ?
ResourceScratchDisk

# Which three statements describe a DataStage installation in a clustered
environment? DK
The conductor node will create the job score and consolidate messages to the
DataStage log.
For clustered implementations, appropriate node pools are useful to reduce data
shipping.
Compiled job and all shared components must be available across all servers.

# Which three defaults are set in DataStage Administrator?
Project level defaults for environment variables.
project level default for compile options
project level default for Runtime Column Propagation

# Which two environment variables should be set to "True" to allow you to see
operator process statistics at run-time in the job log?
$APT_PM_PLAYER_MEMORY
$APT_PM_PLAYER_TIMING

#8 Which three statements are true about National Language Support (NLS?
NLS must be selected during installation to use it.
Within an NLS enabled DataStage environment, maps are used to convert external
data into UTF-#6.
Reading or writing 7-bit ASCII data from a database does not require NLS support.

# Which three statements describe a DataStage installation in a clustered
environment?
The conductor node will create the job score and consolidate messages to the
DataStage log.
For clustered implementations, appropriate node pools are useful to reduce data
shipping.
Compiled job and all shared components must be available across all servers.

# Upon which two conditions does the number of data files created by a File Set
depend?
The number of processing nodes in the default node pool.
The number of disks in the export or default disk pool connected to each processing
node in the default node pool

# Which command line switch can be used to return the most recent start time
for a given job?
dsjob -jobinfo

# You are working on a project that contains a large number of jobs contained in
many folders. You would like to review the jobs created by the former developer
of the project. How can you find these jobs?
Use the Advanced Find feature contained in the Designer interface.

# Which two statements are true for named node pools?DK
Using appropriately named node pools can allow separation of buffering from sorting
disks.
Using appropriately named node pools constraints will limit stages to be executed
only on the nodes defined in the node pools.

# Which three methods can be used to import metadata from a Web Services
Description Language (WSDL document?
Web Service Function Definitions
XML Table Definitions
Web Services WSDL Definitions

# What are two tasks that can create DataStage projects?
Install the DataStage engine.
Add new projects from DataStage Administrator.

# Which two conditions does the No of data files created by a Dataset Set
depend?
The number of processing nodes in the default node pool.
The number of disks in the export or default disk pool connected to each processing
node in the default node pool.

# Which requirement must be met to read from a database in parallel using the
ODBC Enterprise stage ?
Specify the partition col property.

# For APT_DISABLE_COMBINATION which statements are true ?
Disabling generates more processes requiring more system resources and memory.
Globally disables operator combining.

# Techniques you will use to abort a job in Transformer stage ?
Create a dummy output link with a constraint that tests for the condition to abort on
set the "Abort After Rows" property to #.

# The dsrpcd daemon is the means by which processes that represent DataStage
jobs are started ? The environment that DataStage processes inherit when they are
started is the same environment as for dsrpcd.ODBC drivers and some plug-ins
require that certain directories are included in the shared library environment
variable setting for dsrpcd.

# You set environment variable $APT_ORACLE_LOAD_OPTIONS=PTIONS
(DIRECT=TRUE,PARALLEL=TRUE? for loading index organized tables.You set
environment variable $APT_ORACLE_LOAD_OPTIONS=?PTIONS(DIRECT=TRUE,
PARALLEL=TRUE?for loading index organized tables. Which statement is accurate
regarding the resulting effect of this environment variable setting?
Oracle load will fail when executed.

# A job design consists of an input sequential file, a Modify stage, followed by a
Filter stage and an output SequentialFile stage. The job is run on an SMP machine
with a configuration file defined with three nodes. No environment variables
were set for the job. How many osh processes will this job create?DK
9

# Using a DB2 for z/OS source database, a 200 million row source table with 30
million distinct values must be aggregated to calculate the average value of two
column attributes. What would provide optimal performance while satisfying the
business requirements?
Select all source rows using a DB2 API stage.Aggregate using a Sort Aggregator.

# In your DB2 database you have column names that use characters # and $.
Which two steps should be completed to allow DataStage to convert these
characters into internal format and back as necessary?
Set environment variable $DS_ENABLE_RESERVED_CHAR_CONVERT to true.
Avoid using the strings __035__ and __036__ in your IBM DB2 column names.
# When invoking a job from a third-party scheduler, it is often desirable to
invoke a job and wait for its completion in order to return the job's completion
status. Which three commands would invoke a job named "BuildWarehouse" in
project DevProject and wait for the job's completion?
dsjob -run -jobstatus DevProject BuildWarehouse
dsjob -run -userstatus DevProject BuildWarehouse
dsjob -run -wait DevProject BuildWarehouse

# What is the default Array Size in OCI stage ?
32767

# What is the default cache size of Datastage ?
256MB

#Which is a part of Managing active sessions in Datastage ?
Viewing all active sessions
Setting session limits
Opening user details
Disconnecting a session

# What can you do from the Administrator client?
Set up user permissions for projects
Purge job log file
Set Environment variable default value
Add, delete, and move InfoSphere DataStage projects

# In which two situations would not use the Web Services Client stage?
You want to deploy a service.
You need to create a WSDL.

# Which two actions can improve sort performance in a DataStage job?
Specify only the key columns which are necessary.
Minimize the number of sorts used within a job flow.
Adjusting the "Restrict Memory Usage" option in the Sort stage.

# You have created a parallel job in which there are several stages that you want to
be able to re-use in other jobs. You decided to create a parallel shared container
from these stages. Identify two things that are False about this shared container.
It can be used in sequencer jobs.
It can be used in Transformer stage derivations.

1. What are the main features of datastage?DK
DataStage has the following features to aid the design and processing required to
build a data warehouse :
A. Uses graphical design tools. With simple point and click techniques
you can draw a scheme to represent your processing requirements.
B. Extracts data from any number or types of database.
C. Handles all the metadata definitions required to define your data
warehouse.
D. You can view and modify the table definitions at any point during
the design of your application.
E. Aggregates data.
F. You can modify SQL SELECT statements used to extract data.
G. Transforms data. DataStage has a set of predefined transforms and
functions. you can use to convert your data. You can easily extend the
functionality by defining your own transforms to use.
H. Loads the data warehouse.

2. What are Stage Variables, Derivations and Constants?DK
A. Stage Variable - An intermediate processing variable that retains value during
read
and doesnt pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with
a link.

3. Types of views in Datastage Director?DK
There are 3 types of views in Datastage Director
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last run
c) Status View - Warning Messages, Event Messages, Program Generated
Messages.

4. How do you execute datastage job from command line prompt?DK
A) Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname

5. Functionality of Link Partitioner and Link Collector?
A)Link Partitioner: It actually splits data into various partitions or data flows
using
various partition methods.
Link Collector: It collects the data coming from partitions, merges it into a single
data
flow and loads to target.



6. What are the types of jobs available in datastage?DK
A. Server Job
B. Parallel Job
C. Sequencer Job
D. Container Job

7. What is the difference between Server Job and Parallel Jobs?
A. Server jobs were doesnt support the partitioning techniques but parallel jobs
support the partition techniques.
B. Server jobs are not support SMTP,MPP but parallel supports SMTP,MPP.
C. Server jobs are running in single node but parallel jobs are running in multiple
nodes.
D. Server jobs prefer while getting source data is low but data is huge then prefer
the parallel


8. What is a project
Datastage Projects :- A Complete project may contain several jobs and user-
defined components .
Project Contain Datastage jobs.
Built-in components . These are predefined components used in job.
User-defined components . These are customized components created using the
Datastage manager . each user-defined component performs a specific task in a
job .

All work done in project. Projects are created during and after installation
process.you can add project on the Projects tab of Administrator.
A project is associated with a directory.The Project directory is used by
DataStage to store jobs and other datastage objects and metedata.

9. What is sequencer
Graphically create controlling job, without using the job control function.

10. what is a container
A group of stages and link in a job design is called container.
There are two kinds of Containers:
Local and Shared.
Local Containers only exist within the single job they are used. Use Shared
Containers to simplify complex job designs.
Shared Containers exist outside of any specific job. They are listed in the Shared
Containers branch is Manager. These Shared Containers can be added to any job.
Shared containers are frequently used to share a commonly used set of job
components.
A Job Container contains two unique stages. The Container Input stage is used to
pass data into the Container. The Container Output stage is used to pass data out
of the Container.

11. what are different types of containers
Local containers. These are created within a job and are only accessible by that
job. A local container is edited in a tabbed page of the jobs Diagram window.
Shared containers. These are created separately and are stored in the
Repository in the same way that jobs are. Instances of a shared container can be
inserted into any server job in the project. Shared containers are edited in their
own Diagram window.

12. what are mainframe jobs
A Mainframe job is complied and run on the mainframe , Data Extracted by such
jobs is then loaded into the datawarehouse .

13. what are parallel jobs
These are compiled and run on the DataStage server in a similar way to server
jobs , but support parallel processing on SMP,MPP and cluster systems

14. how do you use procedure in datastage job
Use ODBC plug,pass one dummy colomn and give procedure name in SQL tab.



15. What is odbc stage
A Stage that extracts data from or loads data into a database that implements
the industry standard Open Database Connectivity API. Used to represent a data
source , an aggregation step , or target data table ( Server Job Only )

16. What is hash file ? what are its types?
Hash file is just like indexed sequential file , this file internally indexed with a
particular key value . There are two type of hash file Static Hash File and
Dynamic Hash File .

17. what type of hash file is to be used in general in datastage jobs
Static Hash File .

18. what is a stage variable?
In Datastage transformation , we can define some variable and define Value from
source .

19. What are constraints and derivations?
We can create constraints and derivations with datastage variable .

20. how do you reject records in a transformer?DK
Through datastage constraint we can reject record .

21. Why do you need stage variables?
That is depend upon job requirement , through stage variable we can file data.

22. What is the precedence of stage variables,derivations, and constraints?
stage variables =>constraints=> derivations

23. What are data elements?
A specification that describes the type of data in a column and how the data is
converted .

24. What are routines ?
In Datastage routine is just like function , which we call in datastage job . there
are In-Built routine and and also we can create routine .

25. What are transforms and what is the differenece between routines and
transforms
Transforms is used to manipulate data within datastage job .

26. What a datastage macro?DK
In datastage macro can be used in expressions , job control routines and before /
after subroutines . The available macros are concerned with ascertaining job
status .

27. What is job control ?
A job control routine provides the means of controlling other jobs from the
current job . A set of one or more jobs can be validated, run ,reset , stopped ,
scheduled in much the same way as the current job can be .

28. How many types of stage?DK
There are three basic type of stage
Built-in stages :- Supplied with DataStage and used for extracting , aggregating
, transforming , or writing data . All type of job have these stage .
Plug-in stage :- Additional stages that can be installed in DataStage to perform
specialized tasks that the built-in stages do not support. Server jobs and parallel
jobs can make use of these .
Job Sequence Stages :- Special built-in stages which allow you to define
sequences of activities to run. Only job sequencer have these

29. Define the difference between active and Passive Stage?
There are two kinds of stages:
Passive stages define read and write access to data sources and repositories.
Sequential
ODBC
Hashed
Active stages define how data is filtered and transformed.
Transformer
Aggregator
Sort plug-in

30. What are the plugin stages used in your projects.
Plug-In Stage :- A Stage that performd specific processing that is not supported
by the standard server job stage .
Used Plug-in ORAOCI8, Orabulk.

31. What is Plug in stage , How to install plugin stages.
Installation :- Plug-in stages are added to the Diagram window in the same way
as built-in stages. To add a plug-in stage:
A. Do one of the following:
Choose the stage type from the Insert menu.
Click the appropriate stage button on the tool palette.
B. Click in the Diagram window where you want to position the stage. The stage
appears as a square in the Diagram window.
---- Some information about Plug-In Stage
You may find that the built-in stage types do not meet all your requirements for
data extraction or transformation. In this case, you need to use a plug-in stage. The
function and properties of a plug-in stage are determined by the plug-in used when
the stage is inserted. Plug-ins are written to perform specific tasks, for example, to
bulk load data into a data mart.
Two plug-ins are always installed with DataStage: BCPLoad and Orabulk. You can
also choose to install a number of other plug-ins when you install DataStage.
You can write plug-ins. However, contact your local Ascential office before creating
a plug-in, as there are a range of plug-ins already available that may meet your
needs.

30. Difference Between ORAOCI8 and Orabulk?DK
ORAOCI8 :- This Stage allow to connect Oracle Database .
OraBulk :- The Orabulk plug-in generates control and data files for bulk loading
into a single table on an Oracle target database. The files are suitable for loading
into the target database using the Oracle command sqlldr.

32. What is Sort plugin.
A mainframe processing stage that sorts input columns

33. What is Aggregate stage.
A stage type that compute s totals or other functions of set of data .

34. What is the hash file stage and Sequential file stage.
A stage that extracts data or load data into database that contain hashed file .

35. What types of flat files you used.have you used tab delimited.
Sequential flat file with comma separated .

36. What is the Job control code
Job control code used in job control routine to creating controlling job,
which invokes and run other jobs.


A.What is DataStage parallel Extender / Enterprise
Edition (EE)?
Parallel extender is that the parallel processing of data extraction and
transformation application . there are two types of parallel processing
1) Pipeline Parallelism
2) Partition Parallelism.

B.What is a conductor node?
Ans->Actually every process contains a conductor process where the execution
was started and a section leader process for each processing node and
a player process for each set of combined operators and a individual player
process for each uncombined operator.
When ever we want to kill a process we should have to destroy the player
process and then section leader process and then conductor process.

C.How do you execute datastage job from command
line prompt?
Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname

ex:$dsjob -run
and also the options like

-stop -To stop the running job
-lprojects - To list the projects
-ljobs - To list the jobs in project
-lstages - To list the stages present in job.
-llinks - To list the links.
-projectinfo - returns the project information(hostname and project name)
-jobinfo - returns the job information(Job status,job runtime,endtime, etc.,)
-stageinfo - returns the stage name ,stage type,input rows etc.,)
-linkinfo - It returns the link information
-lparams - To list the parameters in a job
-paraminfo - returns the parameters info
-log - add a text message to log.
-logsum - To display the log
-logdetail - To display with details like event_id,time,messge
-lognewest - To display the newest log id.
-report - display a report contains Generated time, start time,elapsed
time,status etc.,
-jobid - Job id information.


D.Difference between sequential file,dataset and
fileset?
Sequential File:
1. Extract/load from/to seq file max 2GB
2. when used as a source at the time of compilation it will be converted into
native format from ASCII
3. Does not support null values
4. Seq file can only be accessed on one node.
Dataset:
1. It preserves partition.it stores data on the nodes so when you read from a
dataset you dont have to repartition the data
2. it stores data in binary in the internal format of datastage. so it takes less
time to read/write from ds to any other source/target.
3. You cannot view the data without datastage.
4. It Creates 2 types of file to storing the data.
A) Descriptor File : Which is created in defined folder/path.
B) Data File : Created in Dataset folder mentioned in configuration file.
5. Dataset (.ds) file cannot be open directly, and you could follow alternative
way to achieve that, Data Set Management, the utility in client tool(such as
Designer and Manager), and command line ORCHADMIN.
Fileset:
1. It stores data in the format similar to that of sequential file.Only advantage of
using fileset over seq file is it preserves partition scheme.
2. you can view the data but in the order defined in partitioning scheme.
3. Fileset creates .fs file and .fs file is stored as ASCII format, so you could
directly open it to see the path of data file and its schema.


# What is the main differences between Lookup,
Join and Merge stages ?
All are used to join tables, but find the difference.

Lookup: when the reference data is very less we use lookup. bcoz the data is
stored in buffer. if the reference data is very large then it wl take time to load
and for lookup.

Join: if the reference data is very large then we wl go for join. bcoz it access the
data directly from the disk. so the
processing time wl be less when compared to lookup. but here in join we cant
capture the rejected data. so we go for merge.

Merge: if we want to capture rejected data (when the join key is not matched)
we use merge stage. for every detailed link there is a reject link to capture
rejected data.

Significant differences that I have noticed are:
1) Number Of Reject Link
(Join) does not support reject link.
(Merge) has as many reject link as the update links (If there are n-input links
then 1 will be master link and n-1 will be the update link).
2) Data Selection
(Join) There are various ways in which data is being selected. e.g. we have
different types of joins inner outer( left right full) cross join etc. So you have
different selection criteria for dropping/selecting a row.
(Merge) Data in Master record and update records are merged only when both
have same value for the merge key columns.

What are the different types of lookup? When one should use sparse lookup in a job?
In DS 7.5 we have 2 types of lookup options are avilable: 1. Normal 2. Sparce
In DS 8.0.1 Onwards, we have 3 types of lookup options are available 1.
Normal 2. Sparce 3. Range
Normal lkp: To perform this lkp data will be stored in the memory first and then
lkp will be performed due to which it takes more execution time if reference
data is high in volume. Normal lookup it takes the entiretable into memory and
perform lookup.
Sparse lkp: Sql query will be directly fired on the database related record due to
which execution is faster than normal lkp. sparse lookup it directly perform the
lookup in database level.
i.e If reference link is directly connected to Db2/OCI Stage and firing one-by-one
query on the DB table to fetcht the result.

Range lookup: this will help you to search records based on perticular range. it
will serch only that perticuler range records and provides good performance
insted of serching the enire record set.

i.e Define the range expression by selecting the upper bound and lower bound
range columns and the required operators.
For example:
Account_Detail.Trans_Date >= Customer_Detail.Start_Date AND
Account_Detail.Trans_Date <= Customer_Detail.End_Date


# Use and Types of Funnel Stage in Datastage ?
The Funnel stage is a processing stage. It copies multiple input data sets to a
single output data set. This operation is useful for combining separate data sets
into a single large data set. The stage can have any number of input links and a
single output link.
The Funnel stage can operate in one of three modes:
Continuous Funnel combines the records of the input data in no
guaranteed order. It takes one record from each input link in turn. If data is not
available on an input link, the stage skips to the next link rather than waiting.
Sort Funnel combines the input records in the order defined by the
value(s) of one or more key columns and the order of the output records is
determined by these sorting keys.
Sequence copies all records from the first input data set to the output
data set, then all the records from the second input data set, and so on.
For all methods the meta data of all input data sets must be identical. Name of
columns should be same in all input links.


#What is the Diffrence Between Link Sort and Sort
Stage?
Or Diffrence Between Link sort and Stage Sort ?

If the volume of the data is low, then we go for link sort.
If the volume of the data is high, then we go for sort stage.
"Link Sort" uses scratch disk (physical location on disk), whereas
"Sort Stage" uses server RAM (Memory). Hence we can change the default memory
size in "Sort Stage"

Using SortStage you have the possibility to create a KeyChangeColumn - not
possible in link sort.
Within a SortStage you have the possibility to increase the memory size per
partition,
Within a SortStage you can define the 'don't sort' option on sort key they are
already sorted.

Link Sort and stage sort,both do the same thing.Only the Sort Stage provides you
with more options like the amount of memory to be used,remove duplicates,sort
in Ascending or descending order,Create change key columns and etc.These
options will not be available to you while using Link Sort.


# what is main difference between change capture
and change apply stages?

Change Capture stage : compares two data set(after and before) and makes a
record of the differences.
change apply stage : combine the changes from the change capture stage with
the original before data set to reproduce the after data set.

Change capture stage catch holds of changesfrom two different datasets and
generates a new column called change code.... change code has values
0-copy
1-insert
2-delete
3-edit/update

Change apply stage applies these changes back to those data sets based on the
chanecode column.


# Difference between Transformer and Basic
Transfomer stage ?
Basic Difference between Transformer and BASIC transfomer stage in parallel
jobs ?

Basic transformer used in server jobs and Parallel Jobs but
It supports one input link, 'n' number of output links, and only one reject link.
Basic transformer will be operating in Sequential mode.
All functions, macros, routines are writtened by using BASIC language.

Parallel Transformer stages
Can have one primary input link, multiple reference input links, and multiple
output links.
The link from the main data input source is designated as the primary input link.
PX Transformer all functions, macros are written in C++ language.
It Supports Partioning of Data.

# Details of Data partitioning and collecting
methods in Datastage?
Partitioning mechanism divides a portion of data into smaller segments, which is then
processed independently by each node in parallel. It helps make a benefit of parallel
architectures like SMP, MPP, Grid computing and Clusters.

1. Auto
2. Same
3. Round robin
4. Hash
5. Entire
6. Random
7. Range
8. Modulus

Collecting is the opposite of partitioning and can be defined as a process of bringing
back data partitions into a single sequential stream (one data partition).
1. Auto
2. Round Robin
3. Ordered
4. Sort Merge
** DATA PARTITIONING METHODS : DATASTAGE SUPPORTS A FEW TYPES OF
DATA PARTITIONING METHODS WHICH CAN BE IMPLEMENTED IN PARALLEL
STAGES:
Auto - default. Datastage Enterprise Edition decides between using Same or Round
Robin partitioning. Typically Same partitioning is used between two parallel stages and
round robin is used between a sequential and an EE stage.

Same - existing partitioning remains unchanged. No data is moved between nodes.

Round robin - rows are alternated evenly accross partitions. This partitioning method
guarantees an exact load balance (the same number of rows processed) between nodes
and is very fast.

Hash - rows with same key column (or multiple columns) go to the same partition.
Hash is very often used and sometimes improves performance, however it is important
to have in mind that hash partitioning does not guarantee load balance and misuse may
lead to skew data and poor performance.

Entire - all rows from a dataset are distributed to each partition. Duplicated rows are
stored and the data volume is significantly increased.

Random - rows are randomly distributed accross partitions

Range - an expensive refinement to hash partitioning. It is imilar to hash but partition
mapping is user-determined and partitions are ordered. Rows are distributed according
to the values in one or more key fields, using a range map (the 'Write Range Map' stage
needs to be used to create it). Range partitioning requires processing the data twice
which makes it hard to find a reason for using it.

Modulus - data is partitioned on one specified numeric field by calculating modulus
against number of partitions. Not used very often.

** DATA COLLECTING METHODS : A COLLECTOR COMBINES PARTITIONS INTO A
SINGLE SEQUENTIAL STREAM. DATASTAGE PARALLEL SUPPORTS THE FOLLOWING
COLLECTING ALGORITHMS:
Auto - the default algorithm reads rows from a partition as soon as they are ready.
This may lead to producing different row orders in different runs with identical data.
The execution is non-deterministic.

Round Robin - picks rows from input partition patiently, for instance: first row from
partition 0, next from partition 1, even if other partitions can produce rows faster than
partition 1.

Ordered - reads all rows from first partition, then second partition, then third and so
on.

Sort Merge - produces a globally sorted sequential stream from within partition sorted
rows. Sort Merge produces a non-deterministic on un-keyed columns sorted sequential
stream using the following algorithm: always pick the partition that produces the row
with the smallest key value.



#Remove duplicates using Sort Stage and Remove
Duplicate Stages and Diffrence?
We can remove duplicates using both stages but in the sort stage we can capture
duplicate records using create key change column property.

1)The advantage of using sort stage over remove duplicate stage is that sort stage allows
us to capture the duplicate records whereas remove duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and
try to get the last rec. The same can be done using sort stage by sorting in descending
order to retain the first record.


#What is difference between Copy & transformer
stage ?
In a copy stage there are no constraints or derivation so it surely should perform better
than a transformer. If you want a copy of a dataset you better use the copy stage and if
there any business rules to be applied to the dataset you better use the transformer
stage.

We use the copy stage to change the metadata of input dataset(like changing the
column name)

# What is the use Enterprise Pivot Stage ?
The Pivot Enterprise stage is a processing stage that pivots data horizontally and
vertically.
Specifying a horizontal pivot operation : Use the Pivot Enterprise stage to
horizontally pivot data to map sets of input columns onto single output columns.

Table 1. Input data for a simple horizontal pivot operation
REPID last_name Jan_sales Feb_sales Mar_sales
100 Smith 1234.08 1456.80 1578.00
101 Yamada 1245.20 1765.00 1934.22
Table 2. Output data for a simple horizontal pivot operation
REPID last_name Q1sales Pivot_index
Table 1. Input data for a simple horizontal pivot operation
REPID last_name Jan_sales Feb_sales Mar_sales
100 Smith 1234.08 0
100 Smith 1456.80 1
100 Smith 1578.00 2
101 Yamada 1245.20 0
101 Yamada 1765.00 1
101 Yamada 1934.22 2


Specifying a vertical pivot operation: Use the Pivot Enterprise stage to vertically pivot
data and then map the resulting columns onto the output columns.

Table 1. Input data for vertical pivot operation

REPID last_name Q_sales

100 Smith 1234.08

100 Smith 1456.80

100 Smith 1578.00

101 Yamada 1245.20

101 Yamada 1765.00

101 Yamada 1934.22

Table 2. Out put data for vertical pivot operation
REPID last_name
Q_sales
(January)
Q_sales1
(February)
Q_sales2
(March) Q_sales_average
100 Smith 1234.08 1456.80 1578.00 1412.96
101 Yamada 1245.20 1765.00 1934.22 1648.14


# What are Stage Variables, Derivations and
Constants?
Stage Variable - An intermediate processing variable that retains value during read and
doesnt pass the value into target column.
Derivation - Expression that specifies value to be passed onto the target
Constraint- is like a filter condition which limits the number of records coming from
input according to business rule.
The right order is : Stage variables Then Constraints Then Derivations


#What is the difference between change capture
and change apply stages?

Change capture stage is used to get the difference between two sources i.e. after
dataset and before dataset. The source which is used as a reference to capture the
changes is called after dataset. The source in which we are looking for the change is
called before dataset. This change capture will add one field called "chage code" in the
output from this stage. By this change code one can recognize which kind of change this
is like whether it is delete, insert or update.

Change apply stage is used along with change capture stage. It takes change code from
the change capture stage and apply all the changes in the before dataset based on the
change code.

Change Capture is used to capture the changes between the two sources.

Change Apply will apply those changes in the output file.
Page 1 of 46 : Sample Certification Dump

QUESTION NO: 1
The derivation for a stage variable is: Upcase(input_column1) : ' ' :
Upcase(input_column2).
Suppose that input_column1 contains a NULL value. Which behavior is expected?
A. The job aborts.
B. NULL is written to the target stage variable.
C. The input row is either dropped or rejected depending on whether the
Transformer has a reject link.
D. The target stage variable is populated with spaces or zeros depending on the
stage variable
data type.
Answer: B


QUESTION NO: 2
You are processing groups of rows in a Transformer. The first row in each group
contains "1" in
the Flag column and "0" in the remaining rows of the group. At the end of each group
you want to sum and output the QTY column values. Which three techniques will
enable you to retrieve the sum of the last group? (Choose three.)
A. Output the sum that you generated each time you process a row for which
theLastRow()
function returns True.
B. Output the sum that you generated up to the previous row each time youprocess a
row with a "1" in the Flag column.
C. Within each group sort the Flag column in ascending order. Output the sum each
time you
process the row with a "1" in the Flag column.
D. Output a running total for each group for each row. Follow the Transformer stage
by an
Aggregator stage. Take the MAX of the QTY column for each group.
E. Output the sum that you generated up to the previous row each time you process
a row with a "1" in the Flag column. Use the LastRow() function to determine when
the last group is done.
Answer: C,D,E


QUESTION NO: 3
Records in a source file must be copied to multiple output streams for further
processing. Which
two conditions would require the use of a Transformer stage instead of a Copy stage?
(Choose
two.)
A. Renaming one or more output columns.
B. Concatenating data from multiple input columns.
C. Converting some input columns from integers to strings.
D. Directing selected output records down one output link rather than another.
Answer: B,D


QUESTION NO: 4
A job needs to split a single Data Set into three Data Sets based on conditions that
are supplied at runtime. Which stage would allow you to parameterize the conditions
for splitting the input data set?
A. Filter stage
B. Switch stage
C. Transformer stage
D. Split Vector stage
Answer: A


QUESTION NO: 5
Which two tasks can the Slowly Changing Dimensions (SCD) stage perform? (Choose
two.)
A. Look up whether a record with a matching business key value exists in a
dimension table. If it
does, add new values for selected fields to values lists for those fields.
B. Look up whether a record with a matching business key value exists in a fact
table. If it does
not, retrieve a new surrogate key value and insert a new row into the fact table.
C. Look up whether a record with a matching business key value exists in a
dimension table. If it
does not, retrieve a new surrogate key value and insert a new row into the
dimension table.
D. Look up whether a record with a matching business key value exists in a
dimension table. If it
does, mark the record as not-current, and generate a new record with new values
for selected
fields.

Answer: C,D

You might also like