Professional Documents
Culture Documents
Copyright
This is a preliminary document and may be changed substantially prior to final
commercial release of the software described herein.
The information contained in this document represents the current view of Microsoft
Corporation on the issues discussed as of the date of publication. Because Microsoft
must respond to changing market conditions, it should not be interpreted to be a
commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of
any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced, stored
in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as
expressly provided in any written license agreement from Microsoft, the furnishing of
this document does not give you any license to these patents, trademarks, copyrights,
or other intellectual property.
2008 Microsoft Corporation. All rights reserved.
Microsoft, Excel, and SQL Server are either registered trademarks or trademarks of
Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks
of their respective owners.
Table of Contents
Introduction ......................................................................................................1
Challenges and Product Limitations ..................................................................1
Enterprise Standard......................................................................................... 1
Developer Productivity ..................................................................................... 1
Data Lineage and Audit Trail ............................................................................. 2
Scale Out with Commodity Hardware ................................................................. 2
Usage Scenario of Metadata-Driven ETL ............................................................2
Platform Architecture ........................................................................................3
Metadata Repository .........................................................................................4
Builder ..............................................................................................................5
Defining SSIS Control Flow ............................................................................... 5
Dynamic Generating SSIS Packages .................................................................. 7
Controller and Worker .......................................................................................8
Distributed Execution ....................................................................................... 8
Unified Logging ............................................................................................... 9
Monitor .............................................................................................................9
Metadata Editor ...............................................................................................10
ETL Pattern Library .........................................................................................10
Further Enhancements ....................................................................................10
Metadata Repository Manager ......................................................................... 10
Business Rule Engine ..................................................................................... 10
Data Quality ................................................................................................. 11
Putting It Together ........................................................................................ 11
Conclusion.......................................................................................................11
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
Introduction
Microsoft SQL Server 2008 Integration Services (SSIS) enables enterprise customers
to create, deploy, and manage high-performance data integration solutions. Some of
the most common scenarios of using SSIS are building data warehouses (DW) and
developing business intelligence (BI) solutions. A data warehouse is defined by Bill
Inmon as a subject oriented, integrated, time variant, nonvolatile collection of data in
support of management decision. Extract, transform, and load (ETL) is a crucial
process in data warehousing that involves extracting data from outside sources,
transforming it to fit business needs, and ultimately loading it into the end target,
usually the data warehouse. ETL is an important part of the process of bringing
heterogeneous and asynchronous source extracts into a homogeneous environment.
SQL Server Integration Services can pull data from a wide variety of sources including
relational databases, Microsoft Excel files, XML files, and flat files. It also includes a
rich set of tools and components for developing and executing ETL packages. You can
create SSIS packages manually by using SQL Server Business Intelligence Development
Studio or programmatically by using SSIS APIs.
Although SSIS offers powerful capabilities for building robust ETL solutions, customers
still face many challenges when implementing large-scale data warehouses. This paper
describes how to extend SSIS to a metadata-driven platform to better address those
challenges.
Enterprise Standard
For a large data warehousing system in an enterprise environment, it is important to
standardize ETL processes, including unified logging, checkpoint, and exception
handling. A standard is a description of precise behaviors or actions that can help to
prevent the creation of different flavors of SSIS packages, which can make them
difficult for other developers to understand. Standards also help to improve the
productivity of the team, the quality of the application, and the maintainability and
understandability of the system. Although creating SSIS packages based on predefined
templates is a good practice, this paper introduces a comprehensive metadata-driven
approach to enforce enterprise standards.
Developer Productivity
Developers can create and deploy SSIS packages by using SQL Server Business
Intelligence Development Studio, which offers a flexible way to define and execute ETL
tasks. While SQL Server Business Intelligence Development Studio is very effective for
designing individual and simple ETL packages, for large data warehousing systems with
hundreds of packages, it is very labor intensive and error prone to develop, test,
deploy, and maintain a large number of SSIS packages for data acquisition, integration,
and distribution. A cost effective alternative is to enable developers to define ETL
processes using metadata definitions without the need to worry about common tasks
such as logging and exception handling. ETL patterns are custom implementations of
Microsoft Corporation 2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
To address these challenges, this paper presents the design of a metadatadriven ETL platform that focuses on optimizing the acquisition, integration, and
distribution needs of enterprise data warehouses. The platform is
complementary to SQL Server Integration Services. It helps reduce the total
cost of ownership of large enterprise data warehouse systems and BI solutions.
Filename: Document1
1. The ETL developer defines the source and destination of the extraction process,
including servers, relational databases, and tables.
2. For each table, the schema (columns, index, and constraints) can either be
automatically retrieved by the system or manually specified by the ETL developer.
3. The ETL developer specifies the mapping between source and destination at the
database, table, and column level.
4. The ETL developer selects either a full or delta load.
5. The ETL developer configures an orchestration process. For each table, ETL
developers define one or multiple steps for performing the extraction.
6. The ETL developer specifies how ETL packages should be executedat a scheduled
time or on demand.
7. The system dynamically generates one or more SSIS packages.
8. The system deploys the SSIS packages to a distributed execution environment.
9. The system executes ETL jobs and captures the job status.
The outcomes of the usage scenario are:
SSIS packages are executed. Data is moved from source systems to destinations.
System environmental parameters, such as server name, data source name, folder
location, and connection parameters
Logging parameters
Platform Architecture
The design goals of the platform include improving the productivity of developers,
enforcing ETL standards, supporting a cost-effective way to deploy large data
warehouses on commodity hardware, and providing a centralize metadata repository for
lineage tracking. The solution architecture of the platform is shown in Figure 1. The
intent of this paper is not to document all the implementation details. Rather, it
describes the concepts and key components of the platform and how they are
Microsoft Corporation 2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
connected with SQL Server Integration Services. For a more detailed architecture
diagram, see Figure 6 at the end of this paper.
Metadata repository. Stores the ETL definitions, which describe the data sources,
destinations, data mappings, data transformations, and orchestration processes
Metadata editor. Used for managing metadata via a graphical user interface
Logging repository. Stores status data for building and executing SSIS packages
The following sections describe each component in detail. With an open architectural
centered around metadata repository, this platform can be further extended.
Metadata Repository
The metadata repository is used to store technical and business metadata, including but
not limited to data source and destination definition, data movement and pattern
definition, and orchestration process definition. It can be further extended to include
business rule definition, data quality metric definition, and so on. Figure 2 shows a
sample metadata model1. Note that this model is for illustration purposes only and does
not necessarily reflect the actual data model we implemented. In the example:
Data package defines the data source and destination entities to be used in the ETL
process. Data packages can be implemented as a hierarchy and includes data groups
and data elements. For relational data sources, a database is a type of data package, a
table is a type of data group, and a column is a type of data element.
The data model is based on the book Universal Meta Data Models, published by David
Marco.
Microsoft Corporation 2008
Filename: Document1
Data movement defines the mapping between the source and destination. The mapping
can be done at multiple levels. For relational databases, it can be at the database,
table, and column level.
Data transformation defines the transformation rules that will be applied to the data
movement. Transformation rules can be implement using store procedures and
managed coded. Reusable code can be saved as ETL patterns.
In addition, other metadata is required by the platform. For example, data
orchestration defines how the ETL jobs will be executed and the precedence of tasks.
Builder
The builder is designed to automatically generate SSIS packages and instances of
custom code based on metadata definitions. Before the release of SSIS, many
organizations developed their own custom code to perform ETL. The SSIS makes it
possible for organizations to leverage their existing investments in data integration and
take full advantage of features such as unified logging and distributed execution.
Standards and best practices are enforced by the builder.
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
customers, other customers may place more emphasis on standardized processes. For
enterprise customers, enforcing standards not only improves the productivity of
developers, but also helps reduce maintenance and operational costs. Instead of using
SQL Server Business Intelligence Development Studio to design SSIS control flows, we
provided an orchestration template as an abstraction layer. With this template, we can
enforce standards and apply best practice behind the scene. This is especially useful for
large data warehousing development.
In the orchestration template, ETL processes are defined within a simple hierarchy,
consisting of systems, processes, and tasks. A task is represented as a unit of running
code, such as the execution of a stored procedure or a SSIS data flow task. Processes
are simply a means of defining groups of tasks. The hierarchy can be described in
outline form:
System
SystemVariable
SystemConnection
Process
ProcessPrecedent
Task
TaskPrecedent
A system is analogous to an SSIS package, a process is analogous to an SSIS sequence
container, and a task is analogous to an SSIS task object. Figure 3 shows an example
of how an ETL package can be defined by using the orchestration template.
Filename: Document1
Microsoft.SqlServer.Dts.Runtime
Application
Connections
ConnectionManager
DtsContainer
Executable
LogProviderBase
Package
PrecedenceConstraint
Sequence
Task
TaskHost
Variable
Microsoft.SqlServer.Dts.Tasks.ExecuteSQLTask
On the Microsoft Developer Network (MSDN), you can find documentation and sample
code that shows how to programmatically create SSIS packages. For more information,
see Building Packages Programmatically.
Figure 4 shows a portion of the SSIS package generated for the example described in
Figure 3. Because even in simple scenarios an SSIS package with proper exception
handling and logging can be complex, it would not be easy to manually create the
package from scratch by using SQL Server Business Intelligence Development Studio.
Generating SSIS packages based on metadata definitions automates numerous
repetitive housekeeping tasks, reduces the risk of errors, and enhances developer
productivity.
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
Distributed Execution
In the architecture to support distributed execution, ETL operations can occur on one of
many worker servers, while process-defining metadata resides on a centralized
controller server. The package executed on the worker server sends progress reports
back to the controller and write events to the standard logging system. The controller
and worker architecture is shown in Figure 5.
Filename: Document1
Unified Logging
Logging in this platform consists of a client and a central component. A client logging
component runs on the worker server to collect and manage logging events. This
component uses SQL Server Services Broker to define a logging conversation and send
messages from the worker server to the centralized logging repository. The ETL process
produces common output that provides the ability to review the status of ETL jobs. In
addition, an SSIS log provider and a logging interface for custom code are provided to
enable all messages, including the SSIS log, platform run-time messages, and custom
code events, to be written to the same logging stream.
Monitor
The monitor provides a consistent and user-friendly interface for reporting on system
status. The tool can report the current status and historical statistics in an easily
consumable format. The monitoring tool facilitates a better customer experience as well
as saving debugging and troubleshooting time. The monitoring tool can be further
Microsoft Corporation 2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
10
enhanced with proactive notification and integration with Microsoft System Center
(formerly named Microsoft Operations Manager).
Metadata Editor
Currently, developers use templates stored in a Microsoft Excel file for data entry. An
import wizard is used to load metadata definitions from Excel worksheets to a central
metadata repository. Further improvements to the ETL platform may include tools to
receive metadata (tables, columns, and relationships) from data modeling tools. A
metadata editor is planned but not yet implemented to capture and maintain metadata
definitions for ETL. It will provide a friendly user interface for creating, retrieving,
updating and deleting metadata.
Note: Not all metadata must be captured manually. The metadata editor should be
capable of performing scans and discovering source database schemas in order to
build initial target schema as well as to uncover source system changes that may
impact ETL jobs.
Further Enhancements
The ETL platform can be further extended to include features described in sections
below.
Filename: Document1
11
Data Quality
Data quality metrics will be used to measure the completeness, validity, consistency,
and accuracy of data. For example, data quality metrics can ensure that nonconforming data is flagged as an error, loaded to an error staging area, and not loaded
to the data warehouse until corrected. A successful data quality management strategy
involves three key tasks: profiling, cleansing, and auditing. The metadata repository
can be further extended to store metadata about data quality.
Putting It Together
A proposed system architecture, with future enhancements, is shown in Figure 6. It
highlights major components of the platform as well as how the platform is integrated
with SSIS.
Controller/Worker
s
rie
MetadatantRepository
gE
Lo
Load Balancer
Designer
ETL Process Metadata
SSIS Runtime
Metadata Editor
Package
Metadata
Schema Metadata
Business Rules
Task
Builder
Metadata Repository
Manager
et
ad
at
SSIS Package
Generator
Task
Container
Task
Task
Task
Packages
BizRule Engine
Task
Logging Repository
Da
ta
a
at
ad
et
M
Package Deployer
St
at
us
Monitor
SSIS
Data Flow
Components
ETL Pattern
Instance
SSIS Adapter
Custom Data
Adapters
Historical Data
Analyzer
Log Entries
Logger
Conclusion
Microsoft SQL Server Integration Services 2008 (SSIS) offers great capabilities for highperformance ETL and a cost effective product for developing enterprise data warehouse
solutions. By standardizing ETL processes, improving developer productivity, and
reducing operational cost, the metadata-driven ETL platform built on top of SSIS
described in this paper enables enterprise customers to reduce the total cost of
ownership of data warehousing systems.
Microsoft Corporation 2008
Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services
12