You are on page 1of 42

Sreenivas K

04/17/14 1
A Pentaho Data Integration tool
Introduction
ETL Process
Pentahos Kettle
Data Integration Challenges
Prerequisites and Recent Releases
Pentaho DI Components
JDBC
Spoon
Transformations
Jobs



04/17/14 2 MaxQDPro: Kettle- ETL Tool
4 major components:
Extracting
Gathering raw data from source systems and storing it in ETL
staging environment
Data profiling
Identifying data that changed since last load
Transforming- Cleaning and Conforming
Processing data to improve its quality, format it, merge from
multiple sources, enforce conformed dimensions
Data cleansing
Recording error events
Audit dimensions
Creating and maintaining conformed dimensions and facts


04/17/14 MaxQDPro: Kettle- ETL Tool 3
Data filtering
Is not null, greater than, less than, includes
Field manipulation
Trimming, padding, upper and lowercase conversion
Data calculations
+ - X / , average, absolute value, arctangent, natural logarithm
Date manipulation
First day of month, Last day of month, add months, week of year, day
of year
Data type conversion
String to number, number to string, date to number
Merging fields & splitting fields
Looking up date
Look up in a database, in a text file, an excel sheet,



04/17/14 4 MaxQDPro: Kettle- ETL Tool
Loading
Loading data into data warehouse tables
Managing hierarchies in dimensions
Managing special dimensions such as date and time,
junk, mini, shrunken, small static, and user-
maintained dimensions
Fact table loading
Building and maintaining bridge dimension tables
Handling late arriving data
Management of conformed dimensions
Administration of fact tables
Building aggregations
Building OLAP cubes
Transferring DW data to other environment for
specific purposes



04/17/14 MaxQDPro: Kettle- ETL Tool 5
04/17/14 MaxQDPro: Kettle- ETL Tool 6
Complexity and significant operational
problems.
Exceeds the designers expectations
Data Profiling of a source.
Data warehouses typically grow
asynchronously.
Establishing the scalability of an ETL system
across the lifetime .

04/17/14 MaxQDPro: Kettle- ETL Tool 7
Many off-the-shelf tools exist
High-end tools may not justify value for
smaller warehouses
Proprietary ETL
High upfront cost
Long term maintenance
Custom Code
Low upfront cost
Support grows as business requirements changes
04/17/14 8 MaxQDPro: Kettle- ETL Tool
04/17/14 MaxQDPro: Kettle- ETL Tool 9
Tool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Objects
IBM Information Server (Ascential) IBM
SAS Data Integration Studio SAS Institute
PowerCenter Informatica
Oracle Data Integrator (Sunopsis) Oracle
Data Migrator Information Builders
Integration Services Microsoft
Talend Open Studio Talend
DataFlow Group 1 Software (Sagent)
Data Integrator Pervasive
Transformation Server DataMirror
Transformation Manager ETL Solutions Ltd.
Data Manager Cognos
DT/Studio Embarcadero Technologies
ETL4ALL IKAN
DB2 Warehouse Edition IBM
Jitterbit Jitterbit
Pentaho Data Integration Pentaho
Kettle Kettle Extraction Transformation
Transportation & Loading tool
Its open source business intelligence suite for
powerful data integration by Pentaho.
Founded in 2004.
Products of Pentaho
Mondrain OLAP server written in Java
Kettle ETL tool
Weka Machine learning and Data mining tool
04/17/14 10 MaxQDPro: Kettle- ETL Tool
Data is everywhere
Data is inconsistent
Records are different in each system
Performance issues
Running queries to summarize data for stipulated
long period takes operating system for task
Brings the OS on max load
Data is never all in Data Warehouse
Excel sheet, acquisition, new application
04/17/14 11 MaxQDPro: Kettle- ETL Tool
Meta data , model driven approach
What to do? And how to do?
Complex transformation with zero code
Graphically design data transformation and jobs
100% Java with cross-platform support
Extensible architecture
Repository-based
Full featured ETL
Integration with Pentaho Open BI Platform
04/17/14 12 MaxQDPro: Kettle- ETL Tool
Prerequisites Recent Releases
Java Runtime
Environment 1.5 and
above

Compatible with
almost any platform

Compatible with wide
range of Databases
technologies.
4/25 Data Integration 3.0.3 GA

4/18 Data Integration 3.1
Milestone

2/8 Data Integration 3.0.2 GA

12/12 Data Integration 3.0.1 GA

11/15 Data Integration 3.0 GA

10/31 Data Integration 3.0 RC2

10/24 Data Integration 2.5.2 GA

10/08 Data Integration 3.0 RC1

08/24 Data Integration 2.5.1 GA
04/17/14 MaxQDPro: Kettle- ETL Tool 13
Pan
A program to execute transformations designed by Spoon
in XML or database repository.
Transformations are scheduled in batch mode to be run
automatically at regular intervals
Carte
Simple web server to execute transformations and jobs
remotely.
Accept an XML (small servlet) that contains transformation
to execute and the execution configuration.
Allows to remotely monitor, start and stop the
transformations and jobs
Server running in Carte is a Slave Server

04/17/14 MaxQDPro: Kettle- ETL Tool 14
Spoon
GUI that allows you to design transformations and
jobs that can be run with the Kettle tools Pan and
Kitchen
Transformations and Jobs can describe themselves
using an XML file or can be put in a Kettle database
repository.
Spoon is available as executable script and batch file
to make use of tool in heterogeneous environment.
Latest version of Spoon is 3.2 beta version.
Kitchen
Execute jobs designed by Spoon in XML or database
repository


04/17/14 MaxQDPro: Kettle- ETL Tool 15
Create Shortcut with
spoon.ico pointing to bat file
Works on most of OS
Installing
Ensure JRE 1.5 is
installed.
Unzip the binary
distribution in any folder
Launching
spoon.bat in windows
platform
spoon.sh in Unix like
platform
Supported platform
Microsoft Windows
including Vista
Linux GTK: on i386 and
x86_64 processors
Apple's OSX: works both
on PowerPC and Intel
machines
Solaris: using a Motif
interface
AIX, HP-UX, FreeBSD
04/17/14 MaxQDPro: Kettle- ETL Tool 16
Latest JDBC 3.0
JDBC -Database
connectivity Java tool.
Comes in four different
types
Type1: JDBC-ODBC Bridge
Type 2 : Native API partial
Java driver
Type 3 : Middleware Java
Drivers
Type 4: Direct to DB Java
Drivers
Microsoft Based DB like
MS Access rely on Type
1drivers
Oracle, Mysql can be
connected with other
types. But traditionally
used is the Type 4
driver.
JDBC can also operate
in Distributed
environment.
04/17/14 MaxQDPro: Kettle- ETL Tool 17
04/17/14 MaxQDPro: Kettle- ETL Tool 18
04/17/14 MaxQDPro: Kettle- ETL Tool 19
Key Improvement
Execution Results Pane for logs, metrics and
performance graph
Improved Database Connection dialog
Snap to grid (graphical workspace)
Zoom (Graphical Workspace)
Easier to use left panel for the objects palette
Over 30 new or improved Transformation Steps
13 new or improved Job Entries
Support for four new database types - MonetDB,
KingbaseES, Vertica, and HP NeoView
Improved translations
04/17/14 MaxQDPro: Kettle- ETL Tool 20
Repository Connection establishment
Auto login
By setting manually KETTLE_REPOSITORY,
KETTLE_USER and KETTLE_PASSWORD
environmental variables.
Login
By default PDI provides login username and
password ad admin.
It strictly advised to change default password to
avoid any security vulnerablity.
04/17/14 MaxQDPro: Kettle- ETL Tool 21
04/17/14 MaxQDPro: Kettle- ETL Tool 22
04/17/14 MaxQDPro: Kettle- ETL Tool 23
04/17/14 MaxQDPro: Kettle- ETL Tool 24
Transformation
Value: Values are part of a row
and can contain any type of data
Row: a row exists of 0 or more
values
Output stream: an output
stream is a stack of rows that
leaves a step.
Input stream: an input stream is
a stack of rows that enters a
step.
Hop: A hop is a graphical
representation of one or more
data streams between 2 steps.
Note: A note is a piece of
information that can be added
to a transformation
04/17/14 MaxQDPro: Kettle- ETL Tool 25
Engine capable of performing a
multitude of functions such as reading,
manipulating and writing data to and
from various data sources.

Jobs
Job Entry: A job entry is
one part of a job and
performs a certain
Hop: A hop is a graphical
representation of one or
more data streams
between 2 steps
Note: a note is a piece of
information that can be added
to a job
04/17/14 MaxQDPro: Kettle- ETL Tool 26
A way of calling transformations and
controlling the sequence of their
execution. Usually jobs are
scheduled in batch mode to be run
automatically at regular intervals.

Input Steps
Output Steps
Lookup Steps
Transformation
Steps
Join Steps
DW Steps
Mapping Steps
Job Steps
04/17/14 27 MaxQDPro: Kettle- ETL Tool
04/17/14 28 MaxQDPro: Kettle- ETL Tool
04/17/14 29 MaxQDPro: Kettle- ETL Tool
04/17/14 MaxQDPro: Kettle- ETL Tool 30
04/17/14 31 MaxQDPro: Kettle- ETL Tool
04/17/14 32 MaxQDPro: Kettle- ETL Tool
04/17/14 33 MaxQDPro: Kettle- ETL Tool
04/17/14 34 MaxQDPro: Kettle- ETL Tool
Table Output Step
04/17/14 35 MaxQDPro: Kettle- ETL Tool
Insert / Update Output Step
04/17/14 36 MaxQDPro: Kettle- ETL Tool
Besides the execution order, it specifies the condition for next job entry

Unconditional - next job entry will be executed regardless of the result
of the originating job entry.

Follow when result is true - next job entry will only be executed when
the result of the originating job entry is true,

Follow when result is false - next job entry will only be executed when
the result of the originating job entry was false
04/17/14 37 MaxQDPro: Kettle- ETL Tool
04/17/14 38 MaxQDPro: Kettle- ETL Tool
04/17/14 39 MaxQDPro: Kettle- ETL Tool
04/17/14 MaxQDPro: Kettle- ETL Tool 40
Brief Introduction to ETL process
JDBC Repository Connection
Pentaho Data Integration Tool
Components
Pan
Carte
Kitchen
Spoon
Transformation with different Input Data Source
Jobs

04/17/14 MaxQDPro: Kettle- ETL Tool 41
kettle.pentaho.org
Kettle project homepage
kettle.javaforge.com
Kettle community website: forum, source, documentation, tech tips,
samples,
www.pentaho.org/download/
All Pentaho modules, pre-configured with sample data
Developer forums, documentation
Ventana Research Open Source BI Survey
www.mysql.com
White paper - http://dev.mysql.com/tech-
resources/articles/mysql_5.0_pentaho.html
Kettle Webinar - http://www.mysql.com/news-and-events/on-
demand-webinars/pentaho-2006-09-19.php
Roland Bouman blog on Pentaho Data Integration and MySQL
http://rpbouman.blogspot.com/2006/06/pentaho-data-
integration-kettle-turns.html
04/17/14 42 MaxQDPro: Kettle- ETL Tool

You might also like