Professional Documents
Culture Documents
Recap
What is ETL
Extract, transform, and load (ETL) is a process in
include quality levels) Loading it into the end target (database or data warehouse)
Extract
The first part of an ETL process involves extracting the
Web
Email
Images, Video, Audio Files
Staging Area
You need to clean and process your operational data
before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management.
An intrinsic part of the extraction involves, the parsing of extracted data. resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
Transfrom
The transform stage applies a series of rules or
functions to the extracted data from the source to derive the data for loading into the end target.
Transform Operations
Filtering Ignoring Columns Filtering Rows Translating coded values Gender / Status Derived Columns Qty * Unit Price = Amount Amount % Discount Rate = Discount Amount String Functions Joining / Union
Sorting
Splitting Pivoting/Unpivoting
Aggregating
Validating Cleansing Profiling Lookup Data Type Conversion
Load
Loading Data to Destination Database
Auditing Error Files
Report
Email
ETL or ELT
With the advancement in both hardware and data
warehouse software technology, warehouse designers can now consider extract, load and transform (ELT) a viable option.
Challenges
Operational Problems Getting incremental data changes
Complexity
Document
dim_customer High level information about a customer such as name, customer type and customer status. dwprod1.dwstage.crm_cust dwprod1.dwstage.ord_cust crm_cust.custid = ord_cust.cust.cust_nbr crm_cust.cust_type not = 7
Additional Logic
N/A
Tools
Open-source ETL frameworks Apatar CloverETL Flat File Checker Jitterbit 2.0 Pentaho Data Integration RapidMiner Scriptella Talend Open Studio
Proprietary ETL frameworks IBM InfoSphere DataStage Informatica PowerCenter Oracle Data Integrator (ODI) Ab Initio Altova MapForce Phocas ETL Microsoft SQL Server Integration Services
Pentaho / Telend
Any Volunteers?
What need to cover What these tools offer What are the limitations Few screen shots for those tools 5-10 mins presentation with a write up
References
http://en.wikipedia.org/wiki/Extract,_transform,_load