You are on page 1of 19

Lecture 4

Dinesh Asanka dinesha@ecollege.com 777777882

Recap

What is ETL
Extract, transform, and load (ETL) is a process in

database usage and especially in data warehousing that involves:


Extracting data from outside sources Transforming it to fit operational needs (which can

include quality levels) Loading it into the end target (database or data warehouse)

Extract
The first part of an ETL process involves extracting the

data from the source systems. Sources can be (Hetrogeneous)


Relational databases Flat files

Text Files, CSV, TSV, Fixed Length, XML

Web

Email
Images, Video, Audio Files

Staging Area
You need to clean and process your operational data

before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management.

Data Warehouse Architecture (Basic)

Data Warehouse Architecture (with a Staging Area)

An intrinsic part of the extraction involves, the parsing of extracted data. resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.

Transfrom
The transform stage applies a series of rules or

functions to the extracted data from the source to derive the data for loading into the end target.

Transform Operations
Filtering Ignoring Columns Filtering Rows Translating coded values Gender / Status Derived Columns Qty * Unit Price = Amount Amount % Discount Rate = Discount Amount String Functions Joining / Union

Sorting
Splitting Pivoting/Unpivoting

Aggregating
Validating Cleansing Profiling Lookup Data Type Conversion

Load
Loading Data to Destination Database
Auditing Error Files

Report
Email

ETL or ELT
With the advancement in both hardware and data

warehouse software technology, warehouse designers can now consider extract, load and transform (ELT) a viable option.

Challenges
Operational Problems Getting incremental data changes

Triggers Replication CDC Change Tracking

Complexity

Document

Target Table Name Target Table Description

dim_customer High level information about a customer such as name, customer type and customer status. dwprod1.dwstage.crm_cust dwprod1.dwstage.ord_cust crm_cust.custid = ord_cust.cust.cust_nbr crm_cust.cust_type not = 7

Source Table Names Join Rules Filter Criteria

Additional Logic

N/A

Tools
Open-source ETL frameworks Apatar CloverETL Flat File Checker Jitterbit 2.0 Pentaho Data Integration RapidMiner Scriptella Talend Open Studio

Proprietary ETL frameworks IBM InfoSphere DataStage Informatica PowerCenter Oracle Data Integrator (ODI) Ab Initio Altova MapForce Phocas ETL Microsoft SQL Server Integration Services

Pentaho / Telend
Any Volunteers?
What need to cover What these tools offer What are the limitations Few screen shots for those tools 5-10 mins presentation with a write up

References
http://en.wikipedia.org/wiki/Extract,_transform,_load

You might also like