You are on page 1of 105

A Practical Introduction to Ab Initio Software: Part 1

24 August 2007

Course Structure

Part 1: Basic Concepts and DML

Finger Exercises

Day 1
Part 2: Building Applications & Parallelism

Day 2

Part 3: Parallel Topics Database Connectivity (Optional)

Intermediate Exercises

What Does Ab Initio Mean?


Ab Initio is Latin for From the Beginning. From the beginning our software was designed to support a complete range of business applications, from simple to the most complex. Crucial capabilities like parallelism and checkpointing cant be added after the fact. The Graphical Development Environment and a powerful set of components allow our customers to get valuable results from the beginning.

Ab Initios focus

Moving Data

move small and large volumes of data in an efficient manner


deal with the complexity associated with business data High Performance scalable solutions

Better productivity

Ab Initio Software
Ab Initio software is a general-purpose data processing platform for mission-critical applications such as:
Data warehousing Batch processing Click-stream analysis Real Time Applications Data movement Data transformation

Parallel Computer Architecture


Computers come in many shapes and sizes:
Single-CPU, Multi-CPU Network of single-CPU nodes Network of multi-CPU nodes

Multi-CPU machines are often called SMPs (for Symmetric Multi Processors). Specially-built networks of machines are often called MPPs (for Massively Parallel Processors).

A Multi-CPU Computer (SMP)

A Network of Multi-CPU Nodes

A Network of Networks

Ab Initio Provides For:


Distribution - a platform for applications to execute across a collection of processors within the confines of a single machine or across multiple machines. Reduced Run Time Complexity - the ability for applications to run in parallel on any combination of computers where the Ab Initio Co>Operating System is installed from a single point of control.

Applications of Ab Initio Software


Processing just about any form and volume of data.

Parallel sort/merge processing.

Data transformation.

Rehosting of corporate data.

Parallel execution of existing applications.

Applications of Ab Initio Software


Front end of Data Warehouse:
Transformation of disparate sources Aggregation and other preprocessing Referential integrity checking Database loading

Back end of Data Warehouse:


Extraction for external processing Aggregation and loading of Data Marts

Ab Initio Product Architecture

User Applications Development Environments Ab Initio EME

GDE
Component Library

Shell

User-defined Components

3rd Party Components

The Ab Initio Co>Operating System Native Operating System (Unix, Windows, OS/390)

Co>Operating System Services


Parallel and distributed application execution
Control Data Transport

Transactional semantics at the application level. Checkpointing. Monitoring and debugging.

Parallel file management.


Metadata-driven components.

The Graph Model

The Graph Model: Naming the Pieces

Components

Dataset

Datasets

Flows

The Graph Model: Some Details

Ports

Record format metadata

Expression metadata

Components
Components may run on any computer running the Co>Operating System. Different components do different jobs. The particular work a component accomplishes depends upon its parameter settings. Some parameters are data transformations, that is business rules to be applied to an input(s) to produce a required output.

Datasets
A dataset is a source or destination of data. It can be a simple file, a database table, a SAS dataset, ... Datasets may reside on any machine running the Co>Operating System.

Datasets may reside on other machines if connected by FTP or database middleware.


Data is always described by record format metadata (termed dml).

Dataset: Records and Fields

A dataset is made up of records; a record consists of fields. Analogous database terms are rows and columns

Records

0345John 0212Sam 0322Elvis 0492Sue 0121Mary 0221Bill

Smith Spade Jones West Forth Black

Fields

Sources of Record Format Metadata


Record formats can be generated from:
Database catalogs

COBOL copybooks
Other third-party products SAS datasets

One can always resort to manual entry!

A Sandbox Environment
Setting up a standard working environment helps a development team work together. The Sandbox capability allows an application to be designed to be trivially portable The Sandbox contents are a project administrative function

Sandbox Parameters

Start the Ab Initio GDE Open mp/figure-01.mp Go to Project-Edit Sandbox...

Environment Quick Overview


$AI_RUNrun directory $AI_DMLrecord format files

$AI_XFRtransform files
$AI_MPgraphs $AI_DBdatabase config files

$AI_SERIAL - serial source data, other serial data $AI_MFS - Ab Initio multifile directory in training will also contain partition directories (more about this later!) $AI_LOG - A location to place logging files, etc.

Environment Overview
We will make use of environment variables (shortcuts, parms) during class. The goal is to have a development environment which enables the migration of a graph or set of graphs to any other environment with absolutely no changes

Viewing Component Properties

Double click on a component to bring up its Properties Page

Viewing Port Properties

Click on the Ports Tab to view the Port(s) Properties

Record Format Metadata in Graphical Form

0345John 0212Sam 0322Elvis 0492Sue 0121Mary 0221Bill

Smith Spade Jones West Forth Black

Editing Types in GDE


Dont do a Save when exiting

Field name

Field type

Field length

The Record Format Metadata in text form

record

decimal(4) id;
string(6) first_name; string(6) last_name; string(5) newfield; end

Field Names
Names consist of letters, digits, and underscores: a z, A Z, 0 9, _

Note: No spaces, hyphens, $s, #s, %s

Case does matters! ABC and abc are different! Some words are reserved (record, end, date, )

Field Type and Field Length


There are several built-in types available via the drop-down menu. This course uses three types: string, decimal (for all numbers), and date. A date type requires a format specifier that is an exact representation of the date (e.g., MM-DD-YYYY).

A field length is either a number for fixed-length fields, or the delimiter that terminates the field for variable-length fields.

What Data Can Be Described?


There are both fixed-size and variable-length types.

ASCII, EBCDIC, UNICODE character sets are supported.


Supported types can represent strings, numbers, binary numbers, packed decimals, dates Complex data formats can consist of nested records, vectors, ...

Access to Field Characteristics


Some aspects of field descriptions (e.g., date formats) must be accessed via the attribute pane. To see additional attributes, use the Attributes item on the Record Format Editors View Menu or use the Attributes button.

More Record Format Editing

View Attributes.

Length can be delimiter string Date format goes here

Field Type drop-down

Text Record Format for Date Field

record
decimal(4) id; string(6) first_name; string(6) last_name; date("YYYY-DD-MM") newfield; end;

Expressions in DML
Computations are expressed in the algebraic syntax of C, Pascal, etc. Field names act as variables.

Arithmetic operators: +, -, *, ...


Comparison operators: >, <, ==, !=, ... Many built-in functions: string_concat, string_trim, today, date_day_of_week,

(See the Data Manipulation Language Reference for more information on expressions and built-in functions.)

Viewing Data (mp/figure-01.mp)

1. Right click on dataset.

2. Select View Data...

The View Data Panel

Evaluating Expressions from View Data

Type in an expression...

or use the expression editor

Expression Editor

Fields

Functions

Operators

Expression text

Exercise 1: Writing DML


Open mp/ex1.mp The data file ex1.dat contains these lines:
Smith,John,1992.02.23,2400 Jones,Jane,1993.10.29,320 Warren,Jake,1994.11.02,9045

Use the Record Format Editor (New) to create a description of this data: lastname, firstname, pur_date, and amt. Then use View Data to verify the description is correct. Hint: Newline delimiters are written: \n

Simple Components

In these components the record format metadata does not change from input to output

The Filter by Expression Component


For each record on the input port the select_expr parameter is evaluated. If select_expr evaluates true (non-zero), the input record is written to the out port exactly as the input was read. If the select_expr evaluates false (zero), the record is written to the deselect port. The out port must be connected downstream, those records meeting the select_expr criteria The deselect output may be optionally used

Filter Data (Selection)

(figure-02)

1. Push Run button.

2. View monitoring information.

3. View output data.

Expression Parameter

Exercise 2: Data Filtering (Selection)


Using example graph figure-02.mp, change the select expression parameter of the Filter by Expression component to select records with id greater than 215.

Run the application and examine the resulting data.

Keys
A key identifies a single field or set of fields (a composite key) used to organize a dataset in some way. Single field: Multiple field: Modifiers: {id} {last_name; first_name} {id descending}

Used for sorting, grouping, partitioning. (See the Data Manipulation Language Reference for more information on keys. Note: keys are also called collators.)

The Sort Component


Reads records from input port, sorts them by key, and writes the result on the output port.

Sorting (mp/figure-03.mp)

Sorting - The Key Specifier Editor

Exercise 3: Sorting
Using example graph figure-03.mp, change the key parameter of the Sort component to sort the data by first_name.

Run the application and examine the resulting data.

More Complex Components


In these components the record format metadata typically changes (goes through a transformation) from input to output

Data Transformation
Input record format:
record decimal(,) id; date(MMDDYY) bday; string(,)first_name; string(;) last_name; end

0345,090263John,Smith; Drop Reformat Reformat id+1000000


Output record format:
record decimal(7) id; string(8) last_name; date(YYYY.MM.DD) bday; end

Reorder

1000345Smith

1963.09.02

The Reformat Component (mp/figure-04.mp)


Reads records from input port, reformats each according to a transform function (optional in the case of the Reformat Component), and writes the result records to the output (out0) port. Additional output ports (out1, ...) can be created by adjusting the count parameter.

Transformation Functions
A transform function specifies the business rules used to create the output record. Each field of the output record must successfully be assigned a value. Partial output records are not allowed! The Transform Editor is used to create a transform function in a graphical manner.

The Transform Function Editor

Text DML: Transform Function Syntax

Transform Functions look like:


output-variables :: name ( input-variables ) = begin assignments; end;

Assignments look like:


output-variable.field :: expression;

(See the Data Manipulation Language Reference for more information on transform functions.)

The Transform Function in Text Format

out :: reformat (in) = begin out.id :: in.id + 1000000; out.last_name :: string_concat(Mac, in.last_name); end;

A Look Inside the Reformat Component

a b

x y z

A Record arrives at the input port

9 45 QF

out :: trans(in) = begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c); end;

The Record is read into the component

9 45 QF out :: trans(in) = begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c); end;

The Transformation Function is evaluated

9 45 QF out :: trans(in) = begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c); end;

Since every rule within the Transform function


is successful, a result record is issued

out :: trans(in) = begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c); end; 44 9 RG

The result record is written to the output port of the component

out :: trans(in) = begin out.x :: in.b - 1; out.y :: in.a; out.z :: fn(in.c); end;

44 9 RG

Exercise 4: Reformat Data


Using graph figure-04.mp, change the record format metadata of the Simple-Out dataset to add a new field called name of type string(20). Add a business rule to the existing transform function to populate name by concatenating first_name and last_name using string_concat. Run the graph and examine the results. Then modify the transform to trim the spaces from the first name before concatenating with last name to get John Smith rather than John Smith

Data Aggregation

0345Smith 0212Spade 0322Jones 0492West 0121Forth 0221Black

Bristol London Compton London Bristol New York

56 8 12 23 7 42

Bristol Compton London New York

63 12 31 42

Data Aggregation of Sorted/Grouped Input

0345Smith 0121Forth 0322Jones 0212Spade 0492West 0221Black

Bristol Bristol Compton London London New York

56 7 12 8 23 42

Bristol Compton

63 12

London 31 New York 42

The Rollup Component (mp/figure-05.mp)

By default, Rollup reads grouped (sorted) records from the input port, aggregates them as indicated by key and transform parameters, and writes the resulting aggregate record on the out port.

Built-in Functions for Rollup


The following aggregation functions are predefined and are only available in the rollup component:

avg count first last

max min product sum

Rollup Wizard

Note the use of an aggregation function in the expression

Exercise 6: Rollup Data


Using example graph figure-05.mp, modify the transform function to count the number of records for the same city.

Run the application and examine the results.

Joining Data
0345Smith 0212Spade 0322Jones 0492West 0121Forth 0221Black Bristol London Compton London Bristol New York 56 8 12 23 7 42 0322970402 1242.50 0345970924 923.75 0121961211 12392.00 0492971123 234.12 0666950616 2312.10 561997/09/24 81900/01/01 121997/04/02 231997/11/23 71996/12/11 421900/01/01

0345Bristol 0212London 0322Compton 0492London 0121Bristol 0221New York

Joining Sorted Data on the id field

0121Forth 0212Spade 0221Black 0322Jones 0345Smith 0492West

Bristol London New York Compton Bristol London

7 8 42 12 56 23

0121961211 12392.00

0322970402 0345970924 0492971123 0666950616

1242.50 923.75 234.12 2312.10

0121Bristol 0212London ...

71996/12/11 81900/01/01

Building the Output Record

in0:
record decimal(4) id; string(6) name; string(8) city; decimal(3) amount; end

in1:
record decimal(4) id; date(YYMMDD) dt; decimal(9.2) cost; end

out:
record decimal(4) id; string(8) city; decimal(3) amount; date(YYYY/MM/DD)dt; end

What if the in1 record is missing?

in0:
record decimal(4) id; string(6) name; string(8) city; decimal(3) amount; end

in1:
record decimal(4) id; date(YYMMDD) dt; ??? decimal(9.2) cost; end

out:
record decimal(4) id; string(8) city; decimal(3) amount; date(YYYY/MM/DD)dt; end

Prioritized Assignment

Destination

Priority

Source

out.dt out.dt

:1: in1.dt; :2: 1900/01/01;

In DML, a missing value (say, if there is no in1 record) causes an assignment to fail. If an assignment for a left hand side fails, the next priority assignment is tried. There must be one successful assignment for each output field.

Assigning Priorities to Business Rules

Resulting display when out.dt is selected

The Join Component


Join performs a join of inputs. By default, the inputs to join must be sorted and an inner join is computed. Note: The following slides and the on-line example assume the join-type parameter is set to Outer, and thus compute an outer join.

Driving Key, max-core, Record - Required

Joining (mp/figure-06.mp)

A Look Inside the Join Component*

a b

a q

Align inputs by key


a b c a q r

*join-type = Full Outer join

out :: fname(in0, in1) = begin ... ... ... ... ... end;

a x

Records arrive at the inputs of the Join


G 234 42 G NY 4

Align inputs by a

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

The input records are read into the Join component

G 234 42

G NY

Align inputs by a

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

The input Key fields are compared

G 234 42

G NY

Align inputs by a

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

The aligned records are passed to the transformation function

Align inputs by a
G 234 42 G NY 4

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

The transformation engine evaluates based on the inputs

Align inputs by a
G 234 42 G NY 4

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

A result record is emitted and written out as long as all output fields have been successfully computed

Align inputs by a

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

G 24 NY

New records arrive at the inputs of the Join


H 79 23 K IL 8

Align inputs by a

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

Again, they are read into the Join component

H 79 23

K IL

Align inputs by a

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

The input key fields are compared

H 79 23

K IL

Align inputs by a

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

The aligned records are passed to the transformation function

K IL

Align inputs by a
H 79 23
out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

The transformation engine evaluates based on the inputs

K IL

Align inputs by a
H 79 23
out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

A result record is generated and written out as all output fields are successfully computed
K IL 8

Align inputs by a

out :: join(in0, in1) = begin out.a : : in0.a; out.x :1: in1.r + 20; out.x :2: in0.b + 10; out.q :1: in1.q; out.q :2: XX; end;

H 89 XX

Exercise 7: Join Data


Using example graph figure-06.mp, modify the transform function to join visits.dat and last-visits.dat so that no records are rejected.

Run the application, and examine the results. The Unmatched Last Visits dataset should be empty.

Exercise 8 (if time): Join Retaining All Fields


Building upon the graph you created in Exercise 7, create a new output record format and transform function to join visits.dat and last-visits.dat according to the following rules:
Retain all fields from each dataset. Supply defaults where necessary.

Change the necessary parameters, run the application, and examine the results.

Lookup Files
DML provides a facility for looking up records in a dataset based on a key:
lookup(file-name, key-expression)

The data is read from a file into memory.

The GDE provides a Lookup File component as a special dataset with no ports.

Using lookup instead of Join

Using Last-Visits as a lookup file

Configuring a Lookup File


1. Label used as name in lookup expression

4. Set the lookup key

2. Browse for pathname

3. Set record format

Using a lookup file in a Transform Function

Input 0 record format:


record decimal(4) id; string(6) name; string(8) city; decimal(3) amount; end

Output record format:


record decimal(4) id; string(8) city; decimal(3) amount; date(YYYY/MM/DD) dt; end

Transform function:
out :: lookup_info(in) = begin out.id : : in.id; out.city : : in.city; out.amount : : in.amount; out.dt :1 : lookup(Last-Visits, in.id).dt; out.dt :2 : 1900/01/01; end;

Exercise 9 (if time): Lookup


Building upon the graph you created in Exercise 8, convert into lookup format Change the necessary parameters, run the application, and examine the results.

The GDE Debugger


The GDE has a built in debugger capability To enable the Debugger, Debugger:Enable Debugger

The Debugger Toolbar

Enable Debugger

Remove All Watchers

Add Watcher File

Isolate Components

The GDE Debugger


To add a Watcher File, select a flow and click Add Watcher To remove a Watcher File, click Remove All Watchers To Isolate a set of components, select the components to be Isolated, Watcher Files will automatically be placed into the graph by the Debugger.
Note that if the Watcher files do not exist, the GDE will build them during the first run only, using the Watchers on successive runs

Q&A
Any Questions ?

Capgemini
WORLDWIDE HEADQUARTERS 6400 SHAFER COURT ROSEMONT, ILLINOIS USA 60018 Tel. 847.384.6100 Fax 847.384.0500 WWW.Capgemini.COM

24 August 2007

You might also like