You are on page 1of 24

en.SafeWatch Profiling 2.

ETL Technical Guide

April 2013

Copyrights
Copyright 2013 EastNets Holding Ltd. All rights reserved. All contents, including images and graphics; trade names and trademarks in this document are copyrighted, registered or under registration process. You must obtain permission to reproduce any information, graphics, or images from this document. You do not need to obtain to cite, reference, or briefly quote this material as long as proper citation of the source of the information is made.

Trademarks
EastNets is registered Trade Mark of EastNets Holding Ltd. located at Dubai Internet City, Building No.2 Office G02. Tel: +97143912888 Fax: +97143918652 P.O. Box 500135 Dubai-UAE. All brand and product names are trademarks under registration or registered trademarks of its respective companies. Technical specifications and availability are subject to change without notice.

Disclaimer
Although EastNets has made every effort to make this document accurate, up-to-date, and complete, EastNets offers no warrants, express or implied, related to this document. In no event shall EastNets be liable for any loss of profits, loss of business, loss of use or data, interruption of business, or for indirect, special, incidental, or consequential damages of any kind arising from any error in this document.

Send us comments
EastNets welcomes your comments and suggestions on the quality and usefulness of this document. Your input is an important part of the revision process. If you find any errors or have any other suggestions to improve the document quality and clarity, please indicate the chapter and page number (if available). Please send comments to: enss-documentation@eastnets.com

Page 2 of 24

Table of Contents
1 2 OVERVIEW ................................................................................................... 4 HOW ETL WORKS ......................................................................................... 5 2.1 2.2 2.3 3 REPOSITORY PANEL ....................................................................................... 6 ETL JOBS .................................................................................................. 7 REFERENTIAL INTEGRITY VALIDATION ................................................................ 12

CREATING JOBS ......................................................................................... 15

Table of Figures
Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure 1: ETL Tool ................................................................................................. 4 2: Talend Open Studio for Data Integration .................................................... 5 3: Talend Open Studio for Data Integration Layout......................................... 6 4: Repository Panel...................................................................................... 6 5: ETL Jobs Components and Steps ............................................................... 7 6: Add tMssql_Input/tOracle_Input component ............................................... 8 7: Component Properties .............................................................................. 8 8: Loading Parameters, Input Component, and Processing Input Parts ............... 9 9: Connecting tContextLoad with tMSSQl_Input Component ............................. 9 10: Connecting tMSSQl_Input with tMap Component ..................................... 10 11: Mapping ............................................................................................. 10 12: ETL Job Properties ............................................................................... 11 13: Running Job ........................................................................................ 11 14: Referential Integrity Validation Example ................................................. 13 15: Mapping ............................................................................................. 14 16: Creating Connection ............................................................................. 16 17: Database Settings ................................................................................ 16 18: Adding New Job ................................................................................... 17 19: Searching for Components .................................................................... 18 20: Dragging Components .......................................................................... 19 21: Connecting Components ....................................................................... 19 22: tRunJob Settings ................................................................................. 20 23: tJava Settings ..................................................................................... 20 24: tFileInputDelimited Settings .................................................................. 21 25: tOracleInput/tMssqlInput Settings.......................................................... 22 26: Mapping ............................................................................................. 22 27: tSchemaComplianceCheck Settings ........................................................ 23 28: Edit Schema ........................................................................................ 23 29: tFileOutputDelimited Settings ................................................................ 24

Page 3 of 24

1 Overview
ETL is a back office system tool that extracts data from the customer data files and then transforms the data in the right format to be loaded into the en.SafeWatch Profiling Application Framework, in either a real-time or batch setup.

Figure 1: ETL Tool

The need for the ETL Tool aroused when EastNets learned during the implementation of en.SafeWatch Profiling, that using the manual process caused many customers to have data challenges that prolong the project duration and make it more difficult and expensive for EastNets to run the project. The Manual Process is as follows: 1. EastNets communicates the Data layout document with the customer. 2. The customer creates procedures (piece of code) to extract the required data from the core business solution to data files. 3. Extracted files are reviewed and validated by the I&S engineer, by importing the data and going through all import, validation, staging and LTA processes. 4. The extracted data files include errors in the structure, data types, referential constraints and fields constraints. So, multiple iterations of validation process and script amendments are made to end up with the proper data files. In addition, the data import processes doesnt show the completion percentage nor the clear problem log. So EastNets introduces the ETL tool to solve the above issues. It will enable EastNets reduce data preparation phase from 4 month to 1 month or less, reduce the time to market, enhance the ROI, and increase the customer satisfaction level. It can be used for Profiling and Anti-Fraud solutions, allowing EastNets development efforts to be used more efficiently through: 1. Data Structure: - Number of fields. - End of lines (CR/LF). - Encoding (UTF-8 and UTF-16). 2. Validate Data Quality regarding all constraints: - Field type. - Field Length. - Date Format. - Mandatory/Optional field. - Allowed values (e.g. only Y or N allowed). 3. Referential Integrity check

Page 4 of 24

2 How ETL Works


In this section, we will go through the steps needed to run ETL Jobs used to Load Data for en.SafeWatch Profiling 2.0. In order to build ETL tool and its jobs, we shall use the "Talend Open Studio for Data Integration v5.0.3 which is an extensive Java based user interface (so platform independent) that has a graphical ETL job edit as well as a Data Mapper that allows operators or data processing people to connect to the different sources and to perform transformations where necessary. Note: This guide is subject to change and update. The following is provided with the this Guide: Talend Open Studio for Data Integration v5.0.3 The workspace that contains the ETL Jobs. A Properties file that contains the required parameters for ETL Jobs, like (DB connection parameters, email parametersetc.).

The following are the steps needed to run ETL Jobs: 1. Install the Talend Open Studio for Data Integration, and then open it. 2. Select the attached workspace for ETL in the Workspace field. 3. Select EASTNETSETL from the Project box. 4. Click Open.

Figure 2: Talend Open Studio for Data Integration

5. From the next screen click Start. The following screen will be displayed:

Page 5 of 24

Figure 3: Talend Open Studio for Data Integration Layout

2.1 Repository Panel


The Repository panel is the left side panel of the Talend Open Studio for Data Integration layout. Here you can add new jobs, open jobs, add new parameters, add connection to the data base...etc. This panel includes the following: - Job Designs: We divided the Jobs into four folders based on their functionalities: Account Jobs Balance Jobs Customer Jobs Reference Tables Jobs Transactions Jobs

Job Context: Includes the Parameters we defined for each job. Routines: Where we create Java class called FilterCharacters to remove characters that are not allowed depending on en.SafeWatch Profiling 2.0 layout. DB Connections: Where we create db connection to core bank system.

Figure 4: Repository Panel

Page 6 of 24

2.2 ETL Jobs


Every ETL Job in our ETL Tool has the following steps, parts, and components that are to be connected in a specific manner to perform the job:

Figure 5: ETL Jobs Components and Steps

1. Represents loading needed parameters for the job from the Properties file. 2. Represents reading data from the core banking system using tMSSQl_Input component (in case of oracle we use tOracle_Input component) and passing the result data to tMap component that checks the incoming data type of each column and maps it from the core bank table to a field in our output file with the correct order and based on the en.SafeWatch Profiling 2.0 Data Layout. 3. Represents validating each column data length, using tSchemaComplianceCheck, to be compliant with en.SafeWatch Profiling 2.0 Data Layout, and moving incorrect or bad records to a bad file. 4. Represents writing the correct records, using tFileOutputDelimited component, to | file delimited. 5. Represents building a listener to the job, where email is sent in case there is an error in connecting to the core bank database that kills the ETL job. To sum up the above, the ETL Jobs works as follows: Input from the core banking system database. ETL process the input data and prepare it to be compliant with en.SafeWatch Profiling 2.0 Data layout. Output processed data to | file en.SafeWatch Profiling 2.0 as Input. delimited, that will enter the

Output error or bad records data move to | file delimited.

Page 7 of 24

Example: In this example, we used the profiling database as input, due to the fact that we dont have a core banking system database to represent the input for our ETL Jobs. The following are the steps we have done to run the std_account job which is the only completed job right now to be used for testing and explanation. The same can be done for other jobs: 1. Add tMssql_Input or tOracle_Input component: a. Drag the input table (which is here the tAccount from the Profiling db) as shown below:

Figure 6: Add tMssql_Input/tOracle_Input component

b. Double click on the tMSSQL_Input component (tAccount table) to open the component properties section, and then enter the required query in the query attribute as shown below:

Figure 7: Component Properties

Page 8 of 24

2. Connect the 3 Job parts shown in the below figure together; namely the Loading Parameters part, Input Component, and the Processing Input part; by connecting the tMssql_Input or tOracle_Input component with the tFileInputDelimited component(s)from one side and with the tMap component from the other side:

Figure 8: Loading Parameters, Input Component, and Processing Input Parts

a. Right click on the tContextLoad component. A menu will be displayed. b. Select Trigger > On component > Ok. c. Drag the result arrow to the tMSSQl_Input component to connect it with the tContextLoad component as shown below:

Figure 9: Connecting tContextLoad with tMSSQl_Input Component

d. Right click on the tMSSQl_Input component. A menu will be displayed. e. Select Row > Main. f. Drag the result arrow to the tMap component to connect it with the tMSSQl_Input component as shown below:

Page 9 of 24

Figure 10: Connecting tMSSQl_Input with tMap Component

3. Map the input data read from the database to the output file that is compliant with the en.SafeWatch Profiling 2.0 Data Layout. a. Double click on the tMap component. The following screen will be displayed:

Figure 11: Mapping

b. Start mapping each input column read from the database, to an output field in the Output file. c. Make sure that each field in the output file has the same type used in the input file and that the output file name is the same as that in the en.SafeWatch Profiling 2.0 Data Layout. d. Click OK.

Page 10 of 24

4. Take a look at the properties file properties.txt which contains all the parameters for the ETL jobs:

Figure 12: ETL Job Properties

5. Run the job as shown below:

Figure 13: Running Job

The generated files resulting from running the job can be found at the paths specified on the properties.txt file. Each job should generate 2 files; one for the correct records and the other for error or bad records.

Page 11 of 24

2.3 Referential Integrity Validation


In phase 2 of the ETL, we added the Referential Integrity Validation for the input files of en.SafeWatch Profiling 2.0. The references tables files that represent lockups tables are used to validate referential integrity for jobs that need this kind of validation. The following is a list of these jobs: - std_account std_account_custom_field std_account_declaration std_corrbanking std_country_list_detail std_currency_list_detail std_customer std_customer_account std_customer_address std_customer_custom_field std_customer_customer std_customer_decalartion std_customer_financial std_customer_legal std_customer_phyiscal std_eodbalance std_transaction std_transaction_information std_transaction_list_detail

Page 12 of 24

Example: In this example, we used the profiling database as input, due to the fact that we dont have a core banking system database to represent the input for our ETL Jobs. The following are the steps for how we added the Referential Integrity Validation to the std_account job, which is the only completed job right now to be used for testing and explanation:

Figure 14: Referential Integrity Validation Example

In the above figure the following stars represent the following: Yellow Stars: The sub jobs that should be run before the main job in order to generate the needed lookup files for the Referential Integrity Validation. Green Stars: The lookup files generated by the sub jobs. Blue Star: The error file which holds the error rows that failed to pass the Referential Integrity Validation.

1. Add the tMssql_Input or tOracle_Input component (as described in the previous example). 2. Connect the tMssql_Input or tOracle_Input component to the tMap Component (as described in the previous example). 3. Connect the tFileInputDelimited comoponent(s) (green stars) to the tMap Component (as described in the previous example). 4. Double click on the tMap component. The following screen will be displayed, where the numbers in the red squares represent the following: (1) The MSSQL input component or Oracle input component that reads data from the core banking system database table. (2) The lookups files generated from the sub jobs. These are used to check for the Referential Integrity. (3) The output that will be written to a file. This file will be later the input data file for the en.SafeWatch profiling 2.0. (4) The output that will be written to a file. This file will represent the error records that failed to pass the Referential Integrity Validation. (5) The tSettingsMap that is used to manage joining between files.

Page 13 of 24

Figure 15: Mapping

a. Map each field in (1) to a field in (3). b. Map each field in (1) to a field in (4), then click on tSettingsMap (5) and select True for Catch lookup inner join reject. c. Map each key field in (1) to its counter field in (2), then click on tSettingsMap (5) and click on Join Model and select Inner Join d. Click OK. 5. Run the Job (as described in the previous example). Note: You can use the tLogRow component to log generated results in console or file.

Page 14 of 24

3 Creating Jobs
This section describes how to create a job from scratch, and we will have an example on creating the std_account_job, since it is the only completed job right now to be used for testing and explanation: 1. Define the connection to the database. This should be done once and will be used with other jobs: a. From the Repository panel, right click on Db Connection, and a context menu will be displayed:

Page 15 of 24

b. From the context menu, click on Create Connection, and the following window will be displayed:

Figure 16: Creating Connection

c. Enter the required data (the connection Name field is mandatory) then click on Next, and the following window will be displayed:

Figure 17: Database Settings

Page 16 of 24

d. Enter the needed Database Connection information in the specified fields. e. Click on the Check button to check the database settings. f. Click on Finish. 2. Create the empty job: a. From the Repository panel, right click on Job Designs, and a context menu will be displayed:

b. From the context menu, click on Create Job, and the following window will be displayed:

Figure 18: Adding New Job

c. Enter the required data (the Job Name field is mandatory) then click on Finish.

Page 17 of 24

3. Search for the desired components needed to build the job by using the Find Component field of the Palette panel:

Figure 19: Searching for Components

The needed components to build jobs that meet the profiling jobs workflow are as follows: tRunJob (Optional): It is used for running sub-jobs before running the main one. tJava: It is used to load EASTNETS_CONFIG_HOME environment variable path. tContextLoad: It is used to load needed configurations parameters. tFileInputDelimited: It is used to read the properties file that contains the needed configurations, in addition to reading the results of the sub-jobs. tMssqlInput/ tOracleInput: It is used to write the SQL statement to retrieve data from the Core Banking System. tMap: Required. It is used to map the Core Banking System data file output files needed for profiling plus checking referenota. tFileoutputDelimited: It is used to write and generate the files resulting from the job, and to write the error and bad files. tSchemaComplianceCheck: It is used to check whether the core banking system schema is compliant with the en.SafeWatch Profiling Data layout data fields length.

Page 18 of 24

4. Start building the job: a. Drag the needed components to the job panel to look like the following figure:

Figure 20: Dragging Components

b. Connect the components together as described in the following figure:

Figure 21: Connecting Components

Page 19 of 24

5. Set the needed Configurations for each component as follows (the same to be done for any job): - tRunJob: a. Select tRunJob. The settings section for this component will be displayed at the bottom panel. b. In the Job field, specify the needed sub-job from the workspace. c. Check the Die on Child Error checkbox.

Figure 22: tRunJob Settings

tJava: a. Select the tJava component. The settings section for this component will be displayed at the bottom panel. b. In the Code text box, enter the lines as appearing in the below figure:

Figure 23: tJava Settings

Page 20 of 24

tFileInputDelimited: a. Select the tFileInputDelimited component. The settings section for this component will be displayed at the bottom panel. b. Specify the file path in the File name/Stream field. c. Specify the Row Separator. d. Specify the Field Separator. In this example we are using , because we are dealing with configurations properties files, but in other input files we shall use l instead. e. Edit selected Schema. f. Repeat the above steps for every tFileInputDelimited component used in the job.

Figure 24: tFileInputDelimited Settings

tOracleInput/tMssqlInput: a. Select the tOracleInput/tMssqlInput component. The settings section for this component will be displayed at the bottom panel. b. To use a predefined connection on the workspace level, check the Use an existing connection checkbox. In this case, you can edit in the connection properties if desired. c. In case of a new connection, enter the needed connection information in the specified fields. d. Enter your query in the Query textbox.

Page 21 of 24

Figure 25: tOracleInput/tMssqlInput Settings

tMap: a. Double click on the tMap component. The following screen will be displayed, where the numbers in the red squares represent the following: (1) The MSSQL input component or Oracle input component that reads data from the core banking system database table. (2) The lookups files generated from the sub jobs. These are used to check for the Referential Integrity. (3) The output that will be written to a file. This file will be later the input data file for the en.SafeWatch profiling 2.0. (4) The output that will be written to a file. This file will represent the error records that failed to pass the Referential Integrity Validation. (5) The tSettingsMap that is used to manage joining between files.

Figure 26: Mapping

Page 22 of 24

b. Map each field in (1) to a field in (3). c. Map each field in (1) to a field in (4), then click on tSettingsMap (5) and select True for Catch lookup inner join reject. d. Map each key field in (1) to its counter field in (2), then click on tSettingsMap (5) and click on Join Model and select Inner Join e. Click OK. tSchemaComplianceCheck: a. Select the tSchemaComplianceCheck component. The settings section for this component will be displayed at the bottom panel.

Figure 27: tSchemaComplianceCheck Settings

b. Click on Sync Columns. c. Check the Check all columns from schema checkbox. d. Click on Edit Schema, and the following popup will be displayed:

Figure 28: Edit Schema

Page 23 of 24

e. Adjust the maximum field length. f. Click on OK. tFileOutputDelimited: a. Select the tFileOutputDelimited component. The settings section for this component will be displayed at the bottom panel.

Figure 29: tFileOutputDelimited Settings

b. c. d. g. e.

Enter the File Path in the File Name field. Specify the Row Separator. Specify the Field Separator. Click on Sync Columns to sync columns. Click on Edit Schema to edit the selected schema.

6. Now you can run and test the job. Note: - For more information on Talend Open Studio, refer to the following: TalendOpenStudio_Components_RG_51b_EN (Components Reference)

Adobe Acrobat Document

TalendOpenStudio_DI_UG_51b_EN (User Guide)

Adobe Acrobat Document

Page 24 of 24

You might also like