You are on page 1of 19

Ab Initio Interview Questions

1. What is Override Key?


A. If you set the override-key parameter for a particular port, then the input on that
port must be sorted according to the override-key parameter for that port.
In the join component if we are joining two I/P tables has different names as a
primary key for that we will specify one key as a main key from one table and
other key in other table we will specify as an override key.

1A. What is surrogate key?


A. Alternative name for the natural key. We can create this surrogate key in slowly
changing dimensions. This is Type 2 functionality.
Ex: - To maintain historical information and current information.

2. What is Maxcore?
A. Maxcore is Maximum amount of memory used by the component at the run time.
Default values Sort-10MB and Join - 64MB

2A Explain what max-core's are used for and what you need to look out for, when
setting them?
A maximum amount of memory in bytes a component will use,
Too Low: will swap to disk, slow down application,
Too high: consume too many resources, slow down.

2B. What are the components contain Max Core values?


A. Join, Scan, Sort, Rollup

2C. When it spills to disk, where the data is stored by default?


A \Temp directory.

2D. With an in memory join, what needs to fit into max-core?


A all non-driving inputs + overhead for hash table.

2E. With an in memory sort, what needs to fit into max-core?


A 3 times the size of the input data, recommended max. 100MB.

3. What is Outer join?


A. Outer join - sets the record-required parameters for all ports to false.

4. What is not key based partition component?


A. Broadcast, Partition By Load Balance, Partition By Round Robin, Partition By
Percentage, Partition By Expression.

5. What is not key based departition component?


A. Concatenate, Gather.
5A. What is the use of the partition components in graphs?
A. For Data Parallelism.

6. By using concatenate component, Deadlock occurs T/F?


A. True. To avoid dead lock causes Auto Buffer.

7. How to change the field name without changing value?


A. Redefine Component.

8. What is Condition component?


A. Reusable code

9. What is check order component?


A. To check the order of the data

10A. How can u copy the multi file?


A. m_cp

10B. How many multi file commands u know?


A. m_mkdir, m_mkfs, m_cp, m_expand, m_mv, m_eval

10C. How do you kill a Running Ab Initio process?


A. m_kill process_id

11. What command is used to evaluate DML?


A. m_eval, m_dump

12. What command is used to create a multifile system?


A. m_mkfs. Syntax: m_mkfs ctl file1 file2 file3 file4.

13. Which one do you use first? Either a Sort or Partition component?
A. Sort is used after partition is done

14. What information exists in Control file?


A. Partition file reference. Control file has the paths where the multifiles are located.

15. Difference between phases and checkpoints?


Phase is a stage of graph that runs to completion before that start of the next
stage. By dividing a graph into phases, u can save resources avoid deadlocks and
safeguard against failures.
Phase:- We can divide the graph into number of phases. After successful of the
Phase1, Phase2 will run.
A checkpoint is a phase that acts as an intermediate stopping point in a graph to
safeguard against failures. By assigning phases with checkpoints to a graph can
recover completed stages of the graph if failure occurs.
Check point: - If we keep Checkpoint in the graph, we can restart the process
whenever any system failures occur.

15. Explain the concept of phases in Ab Initio and what they can be used for?
A First phase must complete before second runs; can save concurrent resource
usage; saves status after each phase if checkpoint.

16. What directory is used at the time of running a graph?


A. Ab Initio work directory

17. Difference between Scan and Rollup?


A. Scan: Multiple I/P flows and gives multiple O/P’s flows and gives the
Cumulative Running totals depends upon key specifier.
Roll Up: Multiple I/P flows and gives the single output for each group depends
Upon key specifier.

18. Compound Data types?


A. Vector, Record Types, Union

19. What is Partition by key and sort components?


Repartition.

20. What is the difference between Partition by Round-robin and Partition with
Load Balance?
A. Partition by Round-robin: Distributes data records evenly to each output flow in
round-robin fashion.
Partition with Load Balance: Distributes data records to output flow partitions,
Writing more records to the flow partitions that consume records faster.

21 What component is generally used for expanding processing from serial to


parallel?
A partition

22 If you want to do a join in parallel, which partition component would you use?
A partition by key

23 What is the layout tab of a component used for in Ab Initio?


A where component runs AND number of ways parallel

24 If you had an ad-hoc multifile of 100 files, and you wanted to run only 4 ways
parallel, what would you do?
A concatenate or custom component.

25 With in a graph, how would you take a 4 way parallel stream to 8 ways, what
component would you use?
A Repartition using partition ->> gather, or partition -> fan-in component.
25A. If we are creating multifile, what files are created?
A. 1. Control files.
2. Data Files (Serial files) (Partition files).

26 What are include files used for?


A It allows sharing of functions and named types across multiple transforms.

27 Some things to look for when tuning a graph for performance?


A skew, unnecessary sorts, max-cores, ways parallel etc.

28 Explain skew, and how it affects an applications performance?


A If skew is bad, then bottleneck on certain partitions, as spread of data among
partitions uneven.

29. What is look up file?


A. Look up is a Single file. It contains small information.
Look up file is not used for Sorting. Look up file is looking for the data.

29A. What is lookup_local?


A. It is multi file. By using this we can check for the multi files in the local machine.

29B. Explain difference between lookup and lookup_local?


A when lookup file is multifile, only local partition is examined. The input data
must be partitioned on same key as lookup.

30 What files are the database configuration parameters stored in?


A dbconfig(.dbc)

31 If a job fails, how do you rollback to the last successful checkpoint manually?
A m_rollback

38. Explain the difference between local and formal parameters?


A local= static, formal = dynamic at runtime.

39. Where we can store the Temporary files?


A. Temp directory.

42. Compress is not working in Windows Environment? (True/False)


A. False.

43. What version of Ab Initio did u used?


A. GDE: - 1.13.4, 1.12.9, 1,12.5, 1.11, 1.10.11,
Co-Operating System: - 2.12, 2.11, 2.10.11

44. What is the first component did u use in your group?


A. I/P Table component (or) Reformat is used depends on the graph.

45. How Many graphs did u develop in your last project?


A. 15

46. Tell me the functionality of your graph?


A.

47. What are the sources of data?


A. Oracle, Sybase, DB2, Flat files etc...

48. What are the problems did u got while creating graphs?
A.

49. Did u mainly involve in Back-end or front-end?


A. Back-end.

50. What are layouts?


A. List of host and directory files. Layouts contain URL’s of multifiles.

51. Suppose if u have 50,000 records in I/P table and if u wants to test 10 records
how can u test it?
A. Using Filter By Expression. Next_in_sequence () < 10

52. How to join two tables?


A. Using Join component with common column name.

53. How many tables we can join at the time?


A. 18 (or) 20 tables.

54. Tell me about parallelism?


A. 1. Component parallelism
2. Data parallelism
3. Pipe parallelism.

55. You handed a process written in Ab Initio, users are complaining that it runs
Slowly. Out line strategy for improving the performance?
A. 1. Parallelism
2. Sort in memory.
3. Spilling to disk.
4. Carrying around unnecessary columns.

56. What r the folders available in Sand Box?


A. (.DML),(.XFR),(.DB),(.MPC),(.MP),(.DAT),(.MDC),(.DBC),(.CFG),(.MFS)

56A. Explain the concept of a sandbox and what it is used for?


A group associated graphs and files in single directory, where user works (or)
Share parameters

56B. What is the difference between a sandbox and the graph parameters?
A. Sand box parameters are global and can be accessed into any graph for particular
user.
Graph parameters are local to the graph and cannot be accessed into other graphs.

57. How many processors did u use in your last project?


A. 6

58. How can a rollup replace a sort and dedup, when can it does so?
A. Rollup implicitly does a unique sort. If you care which of the duplicates are kept,
You probably cannot use a rollup to replace a sort and dedup.

59. When u runs an Ab Initio graph when does the .rec file get deleted?
A. It is deleted after the graph runs but before the end script runs.

60. Does a join of two sorted data streams preserve their respective sort order?
A. Some times.
If the flows are already sorted and are sorted on the same key the join retains the
sort automatically.
If the flows are not pre-sorted you have the choice to maintain the sort order or
not.

61. What are the difference between Sort & Sort With in Groups components, is
the output the same? Is the performance the same?
A. Sort: It simply sort (By default ascending).
Sort With in groups: It will sort with in the groups using the minor key.
But output of sort & sort with in groups is not same.
The result is the same but the performance of the later is quicker because you are
sub sorting the already sorted data.

62. When deciding upon a partitioning key what reflects a wise choice?
A. Even or nearly even data distribution among partitions denotes a good partitioning
key.

64. Have you ever used the repository?


A.

65. Multifile Unix commands?


A. M_ls -l: - lists all the multi files.
M_dump: -
M_expand: - lists the location of the partitions.

66. What is EME? Why is it used?


A. EME: Enterprise Meta Environment. EME is a high-performance object-oriented
Storage system that inventories and manages various kinds of information
Associated with Ab Initio applications. It provides storage for all aspects of data
Processing system from design information to operations data.

67. We can run the graphs from GDE? How can I do it with out GDE?
A. By deploying the graph as Korn shell script.

68. Can you execute the graph more than once at the same time? How?
A. Yes. By setting .rec file.

69. How do you aggregate summary records?


A. By using rollup component. Depending on the size of the data we use sorted input
Or unsorted input.

70. Which one does you use first sort or Rollup?


A. We use sort first and then Rollup.

71. Explain about your last project and its environment?


A.

72. Did you ever use multiway processing? Adv of Parellel mfs over Serial?
A. Parallel Processing. The data is divided into patitions.

73. Explain about Normalize component?


A. Normalize generates multiple output records from each input record. Normalize
Can separate a data records with a vector field into several individual data
records, each containing one element of the vector.

74. Co-Op is installed on two servers A and B, graph is running on A, How can I
rename the graph on server B?
A. By using Run program. Issue mv command.

75. How do you communicate between two servers?


A. By Node name and SSH keys.

76. What is .rc in Ab Initio? What does it contain?


A. Recovery. It contains the recovered data.

77. After running the graph in GDE, what file is created in sand box?
A. .rec file.

78. I n a Reformat component how do you set the parameters if I have 1 input and
2 output files?
A.
79. What are the databases you used most of the time?
A. Oracle, Teradata.

80. What component do you use to load data into Oracle database?
A. Output table component.

81. What is the main parameter in Output Table component?


A. Commit table parameter.

82. What is the advantage of using commit?


A. For data recovery, you can rollback to the previous commit.

82A. What is an Inner join and a Outer join?


A. Inner join is to get the matching records and the outer join is to get the matching
Plus Non matching records by putting “null”, if the value does not exist.

83. What type of loader you generally use?


A. SQL * Loader.

84. Which one do you prefer among Join and Lookup file if I have two inputs, 1
with 100000 and 2 with 5000 records?
A. If it less records go for Lookup file.

85. How do you access your system remotely?


A. ftp, rlogin, scp, ftp transfers files to and from a remote network site. rlogin is for
Remote login with in the network.

86. Difference between graph and component?


A. A graph

87. What is Abinitio standard environment?


A.

88. What are is_valid, is_error, is_digit, is_defined, string_* functions?


A.

89. Do you have understanding of multifile, can you have them in windows env?
A.

90. There are two datasets A with 100 million records and B with 50,000 records data
is not sorted .you have to join them what component would you use? How can
you modify?
A.

91. In a data stream there is a field from 1-9 what component will you use?
A.

92. How will you optimize SQL code in Abinitio?


A.

93. Abnitio vector to be?


A.

94. Diff. between Scan & Rollup?


A.

95. Explain some string function?


A. lpad.

96. How to Increase Performance?


A.

97. What is the difference between hash Partitioning and Time series partitioning?
A.

98. How do we execute the work scripts from abinito tool?


A.

99. What is mvs?


A.

100. How do you design the objects using Ab Initio?


A.

101. What are the type of data files we have loaded and how we loaded it?
A.

102. Types of partitioning and how do we do partitioning?


A.

103. How to schedule the job load and how much time it will take to load a 10gb
data file and what type of parallelism?
A.

104. What is start and end script?


A.

105. What is run?


A.

106. How do you Recover files manually?


A.

107. You can join two tables using Join key word in SQL?
A.

108. Have you wrote any Packages? Where do you write?


A. Yes, User defined functions in Package editor ,
We wrote error handling functions in package editor , and those are included using
~ (tilde) package name , error handling functions are used to handle the errors.

109. What is Conditional DML?


A.

110. What are Multistage Components?


A. Rollup, Scan, Normalize, Denormalize, Reformat.

111. What are Multistages?


A. There are 5 Stages
Input select
Initialization
Transformation
Finalization
Output selection

112. What is parallelism in Ab initio?


A.

113. Advantage of Ab initio?


A.

114. How to create a Multifile?


A. M_touch.

115. How to crate a Multifile System?


A.

116.How do you get environment variables in Ab initio?


A. M_env

117. How do you executive SQL Statements in Ab initio?


A. Run Execute component.

118. How do you execute UNIX commands?


A. Run Program Component.

119. Explain about partition , departion ,normalization, denormalization


componets?
A.

120. What is the use of lookup file?


A.

121. Why we use intermediate files?


A.

122. what is dedup ?


A.

123. What is Transformation editor?


A. Statement, Variable, Business Rules.

124. Which component discards records?


A. Trash component.

125. Give some examples of a data flow?


A. Fan-flow, parallel-flow, all-to-all flow, multiplex flow.

126. Which of these is an example of a partition component: Partition by field,


broadcast, partition by division, replicate?
A. Partition by field, broadcast.

127. What parameter specifies the memory size for the sort component?
A. Maxcore.

128. Which component does not order records by flow or key?


A. Gather, concatenate.

129. What is the method to create user-defined functions that the validate
component can use to verify data?
A. is_valid function prefix syntax.

130. With in an include statement, what does the ~ {tide} character do? Does it
indicate that the given include file: is relative to the local sandbox xfr directory
A. Yes.

131. In a component MPC file, what does the image line specify?
a) The location of the script or program to execute.
b) The label of the component when displayed in a GDE graph.
c) The icon used when displayed in the GDE component library.
d) The argument list passed to the unitool launcher.
e) None of the above.
A.
132. How do you describe the characteristic of the driving input for the join
component?
A. The records are stored in memory prior to executing the join transform.

133. Which action will cause the current partitioning keys to become invalid?
a) Multiplying partition keys by a constant
b) Joining using non grouped input with fewer keys than Partition by.
c) Using rollup with grouped input with fewer keys than Partition by.
d) Gathering(2) 8way multi-files into a single 8way multi-file.
e) All the above.
A.

134. What environment variable can be modified to alter the format of monitoring
reports?
a) XX_REPORT
b) XX_DEBUG
c) AB_CONFIG
d) XX_MONITOR
e) None of tht above.
A. None of the above

135. What is the component that does not force a phase break?
A. Intermediate file.

136. Which of the following is not a valid DECIMAL format?


a) Left blank padded number
b) Left zero padded number.
c) Number with implicit decimal point.
d) Right blank padded number.
e) None of the above.
A. Right blank padded number.

1) How do you identify duplicates?


2) There are two tables A and B, How can you get all rows in A but not in B?
3) How can you optimize your code in SQL?

1. How can you run a process in the back ground?


A. By appending “&” to the command or by using “bg” or “nohup”.

2. How can you bring the job running in back ground to fore ground?
A. By typing “fg”.

3. What is “awk”? Which one do you prefer “cut” or “awk”?


A awk options filename:
Scan for patterns in a file and process the results.
Cut options filename:
Cut specified fields/characters from lines in a file.
Awk can cut the fields from any place where as cut can do it once.

4.

2. What is is_valid(),is_digit..etc and is_defined() does


3. Recovery (Manual)
4. Diff. between Scan & Rollup
5. Explain some string functions ? lpad.
6. How to Increase Performance ?
7. What is the difference between hash Partitioning and Time series partitioning?.
8. How do we execute the work scripts from abinito tool?.
9. What is mvs?
10. How do you design the objects using Ab Initio?
11. What are the type of data files we have loaded and how we loaded it?
12. Types of partitioning and how do we do partitioning?.
13. How to schedule the job load and how much time it will take to load a
10gb data file and what type of parallelism?
14. What is start and end script?
15. What is run?

gather – gather collects records from many sources, reads data from flow partition.
It reduces data parallelism, reduces pipeline parallelism
And it doesn’t support default record assignment

LocalMerge – reads data from many sorted sources and maintains the sort order

concat
it takes multiple streams of data and append then one after another, it maintains the
order…..

Interleave
It collects the records from many sources in round robin fashion. It reads block size of
records from first partition

Partition by Load Balancing…..To balance uneven processing capability….

Partition by key and sort……all records with same key are in same partition………Local
Lookup…….

Partition by round robin ……………. Distributed data evenly across the out
partitions..reads as chunks

HashPartition…………….Reads records arbitary order from the input and distributes


them to flow partition..

CheckOrder ----------- to verify that data sorted according to your specification….

Transformer

Aggregate – generates summary records for group of input records………

Dedup – removes the duplicate records – suppress the duplicate records………

De-Normalize --- to group multiple records into single out (Praveen,kumar,thadakamalla


– praveenkumarthadakamalla) having the same address with multiple people……

Normalize - use one record into multiple records………….house to people……………

Merge
to co-relate the data – from different sources……….reads records from multiple input
ports………..operates on the records……….with merging keys…..

Merge-Join – to perform inner, outer and semi join in the form of relational database.

Reformat…………used to change the record format of your data…

Rollup – used to generates the summary records for groups of input


records……….requires sorted input…. basically it reduces each group into a single out
put record…………..finalize print one time…Year to Date

Scan --- finalize prints every time………..multistage transform---for a series of summary


records for groups of input records... produces intermediate summary records…

Performance --- check points…. phases…. dead lock…memory release…Local Look


up…before joiner put phase…so lots of memory gets released…
Layout ……….location of URL…………specifies how some part of an application is
partitioned. Eg: component’s layout specifies the number and location of its partitions,
giving a hostname and pathname for each partition. Every component of an application
has a layout (even for one partition).

mkfs………..multi file system…$mpjrect….0…for success…..for check the graph is


successful

m_mkfs --- for creating multi file

M_ls m_cp or m_dump or m_rollback, are co>op system shell-level utilities. (For
managing Parallel files, managing metadata, recovering a check pointed process)

Mp commands – components run with mp commands – ab intio command interpreter eg:


mp command-name argu1 argu2

mp job – establish the “frame work”


mp ifile – defines data file components
mp metadata define metadata that describes data in flows..
mp hash-partition – define program components…
mp run – for running the application……

Skew – monitoring……at the user requests….the co-op system monitor ab-initio jobs and
issue periodic reports…….monitor is control either two ways……….
Shell --- set the confiigration variable xx_report….before running the job….
With in the script supply arguments to the report option the mp run command……the
two interfaces accepts the same set of key word arguments…if both interfaces are
use……the effect addidive….in summary the key words…are verbose error, expanded
graph…….flows……..times………skew, skew = n, scroll = mode….file = filename
Interval = n….table flows………

The two most used values are flows and times……….

The value flows enables monitoring of all data flows…………


The value time enables the monitoring of all process…………..

Additional are for reporting……..characteristics…….the value xx_report are mp run


report….. series of space separated keywords…………

For export xx_report = flows time interval = 10

Basic mp run xx_report……...data bytes………bytes transmitted or received on the


flow….records…….records..transmitted received on the flow…
Un opened/

% by which the amounts of data in the flow partitions are skewed


0% - all partitions have equal data values

100% - means some partitions…has all the data………

charactertistics of ab-intio job……..report……………….

Layout ……..the number and locations of


the partition of the components are described as a layout

you specify a layout for two reasons…

create a multi file system a place where parallel files are stored…

construct a layout object used to describe the parallelism of a component program in a


graph application.

A layout is a list of hostname/pathname pairs…each entry in the list represents one


partition of multi file or a program component…..layouts used derived from layout of
parallel files……..program layouts and remote connection……………..dataset/program
as a layout……defines...location…(hostname+pathname) …depend…….on use…we go
for single layout or multi……….

Recovery

Abnormal termination……….completed job means …job started….and successful..


Software error – c0>op will take care…
Native problem ….don’t investigate…go to native…
Jobname.rec………..host on every node ..contains…set of pointers to the log files on
every node

Log files……………..start/end………..hostsystem…variable/ab-initio/host/unique-id…
sequence character.

Automatic…………..software………………..temporary files……………….kill all the


process..

Investigating…………recapsulation……….

Restore the earlier system…………….shut down………performed in


the……..intermediate…..get the job running

M_rollback(-d,-I, -h)…………manually

-d deletes job with recovery files/log files


-I display the state of the job and prompt the user whether the job should be deleted
jobs at first point will rollback…….
M_rollback my-job.rec

Action of m_rollback for a job with no checkpoints……m_rollback myjob.rec

Xx_nice..xx_timeout……..xx_interval..ab_connection/_script
Ab_home.ab_password..ab_nodes

When running applications, please note the environment variables are passed downward
only.
/usr/local/abinitio
export AB_HOME=/USR/LOCAL/ABINITIO
export PATH = $AB_HOME/BIN:$PATH

The above settings enable your shell to locate the installed Ab Initio Software.

A parallel file is called multi-file system.


Multifiles are stored in parallel directories called multidirectories, which reside in a
multifile system.
URL: protocal://hostname/pathname
Mfile: for multifile
File: for serial file
A multidirectory : mfile://pluto.us.com/usr/ed/mfs1/dat
A multifile: mfile://pluto.us.com/usr/ed/mfs1/dat/s95/new.dat

m_ls –1 mfs

Drwxrwxr-x D owner group 512 may 29 17:39 dir


-rw-rw-r-- M owner group 214 may 29 17:03 5% out.dat

where D and M are multidirectory and multifile.


5% is the file skew, computed on the no. of bytes in each partition.

The concept of skew refers to an unbalanced load among the partitions of a multifle or
among the partitions of a dataflow.
Eg: For a particular flow or file, they are k partitions.
There are total bytes for all partitions.
Then average = total/k.
Average = 1000/20 = 50.
Then, the skew for a partition with n bytes is (n-average)/max ( -100 - 0 - 100%)
By the way the sum of all the skews is 0%.

Metadata – describes data formats and computations.

m_dump produces a human-readable report that shows how input data is interpreted by
Ab Initio metadata.

Like m_dump foo.dml foo.dat (prints data in foo.dat as interpreted by the metadata in
foo.dml)

Mp job (takes care of configure and the execution environment, checkpointing,


termination, recovery, monitoring, debugging)

m_attach to facilitate remote startup on large parallel systems.

m_env display the current settings of the ab initio.

Mp_checkpoint insert in between the two transform components .

In event of failure the application can restart from the most recent checkpoint instead of
from the beginning

Incase of software error or user control-c command…the co>op takes care of automatic
rollback, thus restoring all files, flows, and processes to their initial state or to their state
at the most recent checkpoint.

When a job does not complete normally, it leaves a file in the working directory on the
host system with the name jobname.rec……located at /var/abinitio/vnode/unique-id

Analysing a database table

Once a prototype configuration file is created for the database, each table must be
analyzed with db_config (analyzes a table to determine ..column names and types, the
applicable nodes, and the best scheme for loading or unloading it)

It generates couple of files load, unload, config file


Output_prefix.dml
Output_prefix.dbml
Output_prefix.cfg
Output_prefix.unload
Output_prefix.load

Database Components
DB Unload
DB Load
DB Truncate
SQLrun – run miscellaneous SQL against the database

M_db_env – prints the database environment information

Db_layout for layout of the table db_layout foo.cfg

Unload.dml
Record
String(“,”) name;
Decimal(“,”) age;
End;

Reformat.dml
Record
String (10) name=””;
Decimal(3) age = “”;
End;

You might also like