Data Stage Job Design Approach

Job Design Approach
Agenda
Introduction Framework Scheduling Approach Restart Ability Reusability Templates Modularity and Maintain Ability Performance Considerations
2002. Infosys Technologies Ltd.
Introduction
Job design will be influenced by following points. Framework Scheduling Approach Restart Ability Reusability/Templates Modularity and Maintain Ability Performance Considerations Metadata Management
Framework
Reprocessing System Health Tables ACR Balancing Logs , Errors & Warnings
Framework
Reprocessing - Records will be error out according to business rules defined and records should be reconsidered when the Job runs in next run Reprocessing will be required/enforced, if the quality of data is not good enough. Reprocessing will influence Jobs Design/Framework in many ways
Error records need to be retained to allow corrections, need for landing/work table Job should have logic to handle duplicate records with same natural key
ACR log file should accommodate the count of reprocessed records

End users should be able to identify error records and correct
Framework
System Health Tables Jobs should provide necessary information to maintain , track, and control data loading.
System Health Tables will have data of start and end time of a Job, # of records read, # of records written, # of records bypassed, Start of Batch , end of batch.
System Health Tables will directly/indirectly influence Jobs Design/Framework
To have necessary files generated with necessary information
To have enough information like link counts etc.

Reusable and Common jobs will be identified Scheduling and Sequencing will be influenced
Framework
Few Common Tables from CSL/ABI projects DTMT_PRCS: Stores information about business processes. DTMT_PGM_CNTL: Stores all control table entries. DTMT_PGM_ERR: Stores information about errors occurred during program execution.
DTMT_PGM_EXEC_H: Stores Execution history of every program execution

DTMT_REC_ERR_LOG (Staging table): Staging table for error records to be corrected DTMT_SRC: Contains Source file names DTMT_PGM: Contains details about all the programs
Framework
Logs, Errors, Warning : Datastage jobs should have provisions to maintains logs, Errors and Warnings Logs are required to facilitate in debugging and keep track
Errors and Warning need to be logged to validate business rules and data validations
Restart Ability will play vital role in loading Errors and Warning. Reusability/Common Jobs can be identified
Scheduling
Scheduling approach will effect the Job designs. Scheduling can be done in two approaches
Use Sequencers of DataStage for Sequencing the Job. Use Control M only for Scheduling. Sequences should be build with restart points
Pros : Sequencing Complexity Abstracted inside Sequencers. Pros : Scheduling will be simplified only Starting point Cons : Complexity and additional effort in building sequencers. Sequencing and Job Designs tightly coupled
Use Control M for sequencing and scheduling . Break the functionality required into Restartable jobs and use Control M for sequencing and scheduling
Pros : Simplified Job Design and Sequencing and Job Designs are loosely coupled Pros : Flexibility to break/join jobs without major effect on sequencing. No additional overhead of maintaining Restartable points Cons : Complexity of sequencing is shifted to scheduling.
Scheduling Sequencer Approach
10
Scheduling Control M Approach

The scheduling of jobs/scripts in a project is done through Cntl-m. The dependency between jobs within the same module or across the modules (successor/predecessor) are tracked in an xls and is submitted to the cntl-m team
The dependency of the jobs is set up in the cntl-m using triggers, so that a job starts execution only after all its predecessors completed their execution successfully
The trigger can be the successful completion of a job, presence of a particular file, etc.
Sample Control M excel attached
Requester Name Contact Information Requested Migration Date
Brian Turbes Application 612-304-0476, brian.turbes@target.com Description of Request Test 2/10/2005 Prod
ADW Gift Registry New job setup for application ADWGR 04/01/2005 Time Window for Job Dependencies Start (job names or line number) (optional)
Table Name (If table exists)
Job Name (If job exists)
Action Requested Add, Change, Delete
Server / Account Test Prod
Path Name, Script Name, Parameters
Days Scheduled Holidays (M,T,W,Th,F,Sa, Su)
START_OF_CYCL E
ADWGR0010T
Add
grm etltes t/ adwgradm
START_OF_CYCL E
ADWGR0020T
Add
START_OF_CYCL E
ADWGR0080T
Add
grm etltes t/ adwgradm grm etltes t/ adwgradm
/opt/scripts/test/adwetlrun.ksh -f ADWGR0010T_parms.dat ADWGR ADWGR0010TtableEtlPrcsGrp ADWGR0010T adwgrcur /opt/scripts/test/adwetlrun.ksh -f ADWGR0020T_parms.dat ADWGR ADWGR0020TtableEtlPrcs ADWGR0020T adwgrcur /opt/scripts/test/adwetlrun.ksh -f ADWGR0080T_parms.dat ADWGR ADWGR0080TtablePrcsCntl ADWGR0080T adwgrcur
/opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1005T_parm s .dat ADWGR ADWGR1005TtableGftrgE ADWGR1005T /opt/s cripts /tes t/adwacrrun.ks h ADWGR1005B ADWGR1005B ADW3407 adwgrcur ADWGR
2am
ADWGR0010T
ADWGR0020T
LANDING_JOBS
ADWGR1005T
Add
ADWGR0080T
LANDING_JOBS
ADWGR1005B
Add
LANDING_JOBS
ADWGR1005L
Change
grm etltes t/ adwgradm grm etltes t/ adwgradm grm etltes t/ adwgradm
LANDING_JOBS
ADWGR1008T
Add
F /opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1005L_parm s .dat ADWGR ADWGR0030TtableEtlSubPrcs .ADWGR1005 ADWGR1005L adwgrcur F /opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1008T_parm s .dat ADWGR ADWGR1008Tds s 1008GftrgCus t ADWGR1008T F /opt/s cripts /tes t/adwacrrun.ks h ADWGR1008B ADWGR1008B ADW3401 adwgrcur ADWGR F /opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1008L_parm s .dat ADWGR ADWGR0030TtableEtlSubPrcs .ADWGR1008 ADWGR1008L adwgrcur /opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1010T_parm s .dat ADWGR ADWGR1010TtableGftrgCus tE ADWGR1010T adwgrcur /opt/s cripts /tes t/adwacrrun.ks h ADWGR1010B ADWGR1010B ADW3409 adwgrcur ADWGR /opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1010L_parm s .dat ADWGR ADWGR0030TtableEtlSubPrcs .ADWGR1010 ADWGR1010L adwgrcur /opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1015T_parm s .dat ADWGR ADWGR1015TtableGftrgBabyE ADWGR1015T adwgrcur /opt/s cripts /tes t/adwacrrun.ks h ADWGR1015B ADWGR1015B ADW3402 adwgrcur ADWGR /opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1015L_parm s .dat ADWGR ADWGR0030TtableEtlSubPrcs .ADWGR1015 ADWGR1015L adwgrcur /opt/s cripts /tes t/adwetlrun.ks h -f ADWGR1020T_parm s .dat ADWGR ADWGR1020TtableGftrgCharE ADWGR1020T
ADWGR1005T
ADWGR1005B
ADWGR0080T
LANDING_JOBS
ADWGR1008B
Add
ADWGR1008T
LANDING_JOBS
ADWGR1008L
Change
ADWGR1008B
LANDING_JOBS
ADWGR1010T
Add
ADWGR1008L
LANDING_JOBS
ADWGR1010B
Add
ADWGR1010T
LANDING_JOBS
ADWGR1010L
Change
ADWGR1010B
LANDING_JOBS
ADWGR1015T
Add
ADWGR0080T
LANDING_JOBS
ADWGR1015B
Add
ADWGR1015T
LANDING_JOBS
ADWGR1015L
Change
ADWGR1015B
LANDING_JOBS
ADWGR1020T
Add
ADWGR0080T
11
Restart Ability
Restart Ability will influence Job Designs in breaking up Jobs Restart Ability is very important in ETL Jobs and each Job should be restart able Restart Ability will play vital role in
Loading tables with History
Sequence Number Generation Reprocessing Loading Errors/Warning Tables Loading System Health Tables
If Sequencers are used for sequencing Sequencer Routines and Shell scripts will be place holders to maintain restartable points If Control M is used for sequencing , breaking of Jobs/Identifying Common Jobs is key
12
Reusability
Reusability is very imp in Software projects DataStage allows reusability in following forms
Shared Containers Build Ops Common Jobs Routines Templates
Shared Containers are best form of reusability on DataStage. Typical Examples that are probable for usage of Shared Container are
Sequence Id Generation Logic
Errors/Warning Generation/Loading Loading Landing tables with common functionalities Common Business Rules & Logic A Container is a group of stages and links which will perform a particular task. The container replaces the complex logic into one unit and acts as a stage.
13
Reusability
Build Ops provide Flexibility to write own logic Build Ops can be used to obtain common functionality within/across modules , if logic to achieve that functionality using DataStage stages is complex. Code-ease: Handling complex conditions, say, many nested if-else statements or handling many stage variables and their computation is much easier in BuildOp than Transformer stage. Coding-liberties: BuildOp allows the use of data-structures like arrays and string, loopstatements like for and while loops and many other normal coding paradigms. It also allows use of various header files and their built-in functions. For ex: Include string.h and it provides you with function APT_String, which can be used for string declarations and other string operations. All the above mentioned coding features are otherwise not ease to use in DataStage.
14
Reusability
Common Job will perform common tasks across project/modules taking different parameter to different context Common Jobs should be run in Multiple Instance to allow multiple instances in parallel Routines will help in performing Pre Job Initiation and Post Job Initiation activities like Copying Input files to different directories, ACR File generation , Log Files Etc. Clarity in defining activities between Shell Scripts, DataStage Job , Routines , Sequences,Generic Shell Script is key having clean separation and consistency across project. This will influence the Job Designs The job template should contain generic Annotations which would act as a guideline while creating the jobs All the parameters that are common across all the jobs should be defined in the job templates Specific stage properties that are common or mandatory to be set, should be defined in the job templates Templates will act as Design Pattern/Guideline in achieving consistency and strict enforcement on dos and donts Identifying common patterns and defining templates will achieve consistency Few reusable components will evolve as we progress in project , but enough exercise should be done to bring out reusable components. Piloting a module will also be another option in brining out reusable components
15
Modularity and Maintainability

Modularity and Maintainability is another influencing factor in Job Designs Reusable Components and Restart Ability will bring the required Modularity and Maintainability A proper optimization need to be achieved between Modularity and I/O operations in a Job, keeping Restart Ability into consideration Performance Considerations and Maintainability should be properly balanced. For Ex, Reducing # of Transformers in a Job will enhance the performance , but not at the cost of its maintainability.
16
Performance Considerations
Identifying correct stage for required functionality is key in Job Design Sequencing of stages in Job design should be decided keeping the performance considerations. For ex avoid repartitioning Usage of temporary tables/worktables/datasets may enhance the performance by reducing load on Jobs, which will influence Job Design Make sure all the necessary environment variables are part of template , which can influence performance Consider volume of data while deciding the stage. Detailed points , which can influence performance of Job are covered in performance tuning
17
Metadata Management
Job design will be influenced by Metadata Management Considerations
Jobs should not be driven by Reject Links. To avoid reject links, Looks should have dummy column selected from reference link and should be checked in next stages like transformer.
18

Data Stage Job Design Approach

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Stage Job Design Approach

Uploaded by

Copyright:

Available Formats

Job Design Approach

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

ACR log file should accommodate the count of reprocessed records

2002. Infosys Technologies Ltd.

To have enough information like link counts etc.

2002. Infosys Technologies Ltd.

DTMT_PGM_EXEC_H: Stores Execution history of every program execution

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

Scheduling Sequencer Approach

2002. Infosys Technologies Ltd.

Scheduling Control M Approach

Sample Control M excel attached

Requester Name Contact Information Requested Migration Date

Table Name (If table exists)

Job Name (If job exists)

Action Requested Add, Change, Delete

Server / Account Test Prod

Path Name, Script Name, Parameters

Days Scheduled Holidays (M,T,W,Th,F,Sa, Su)

grm etltes t/ adwgradm

grm etltes t/ adwgradm

grm etltes t/ adwgradm grm etltes t/ adwgradm

grm etltes t/ adwgradm

grm etltes t/ adwgradm grm etltes t/ adwgradm grm etltes t/ adwgradm

grm etltes t/ adwgradm

grm etltes t/ adwgradm grm etltes t/ adwgradm

grm etltes t/ adwgradm

grm etltes t/ adwgradm grm etltes t/ adwgradm

grm etltes t/ adwgradm grm etltes t/ adwgradm

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

Modularity and Maintainability

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

2002. Infosys Technologies Ltd.

You might also like