You are on page 1of 164

MSBI stands for Microsoft Business Intelligence.

This suite is composed of tools which helps in


providing best solutions for Business Intelligence Queries. These tools use Visual studio along
with SQL server. It empowers users to gain access to accurate, up-to-date information for better
decision making in an organization. It offers different tools for different processes which are
required in Business Intelligence (BI) solutions. MSBI is divided into 3 categories:-

1. SSIS SQL Server Integration Services.


2. SSAS SQL Server Anlaysis Services.
3. SSRS SQL Server Reporting Services.
A visual always help better to understand any concept. Below Diagram broadly defines
Microsoft Business Intelligence (MSBI).

Lets understand this picture by taking an example of an organization. Lets take the example
- Calvin Klein (CK). We have outlets of Calvin Klein (CK) in most parts of India. Every outlet stores
their customer data in their respective database and its not mandatory that every outlet is using the
same database. Some outlets may have Sybase as their database, some might be using oracle or
some stores prefer storing their data in simple text files. Before proceeding ahead with our
explanation, we should know what OLTP is? It stands for Online Transaction Processing.
These are basically the online transactions (Insert, Update, Delete) performed on database at every
outlet by the customer. After storing daily data of the customers who visited Calvin Klein outlet at
different stores, the data is then integrated and saved it in a centralized database. This is done

with the help of OLTP component of MS SQL Server. Integration means merging of data from
heterogeneous data stores (i.e. it may be a text file, Spreadsheets, Mainframes, Oracle, etc.),
refreshing data in data warehouses and to cleanse data (e.g. -Date format may be different
for different outlets database, so same format is made to make it even) before loading to remove
errors. Now, you must be clear with the Integration concept. This is our Phase 1- SSIS. Next step is
to analyze the stored centralized data. This huge data is then divided into Data Marts on which
analytic process is carried on. Analysis services use OLAP (Online Analytical Processing)
component and data mining capabilities. It allows to build multi-dimensional structures called
CUBES to pre calculate and store complex aggregations, and also to buildmining models to
perform data analysis which helps to identify valuable information like what are the recent trends,
patterns, dislikes of customers. Business analyst then perform data mining function on multi
dimensional cube structure to look data from different perspectives. Multi Dimensional analysis of
huge data completes the Phase 2- SSAS. Now, the only thing left is to represent this analysis
graphically so that an organization (Calvin Klein) can makeeffective decision to enhance their
revenue, gain maximum profit and to reduce time wastage. So, this is done in forms of Reports,
Scorecards, Plans, Dashboards, Excel workbooks, etc. This data reports will tell the organization
what is the revenue of Calvin Klein in specific time at specific place, where they captured the
market, in which part they are lacking and needs to boost up and many other things which end
users wish to look into. This reporting is done with a tool SQL Server Reporting Services and
completes Phase 3 SSRS.

What is Business Intelligence?


BI (Business Intelligence),refers to set of techniques which helps in spotting,digging out and
analyzing best information out of huge data to enhance decision making. Lets go into the depth of
this concept with an example. Example :- Lets take a basic example to understand how business
intelligence can be beneficial for an organization :- Suppose, we have historical data of a
Shopping Mart of 3-6 months. Here, in the data we have different products with respective product
specifications. Lets choose one of the products-say Candles. We have three types of candles in
this category say Candle A, Candle B, Candle C. On mining of this data we come to know that sale
of Candle C was maximum of these three categories. Now again digging into this data we got the
result that the sale of this candle was maximum between the time intervals of 9 am to 11 am. On
further analysis, we came to the conclusion that this particular candle is the one used in church.
Now, lets apply business intelligence for this analysis :- What a business person /firm can do is,
get other material that can be used in church and keep them in the vicinity of those candles. Now
the customers coming to buy the candles for church can also have a look on the other church
materials and may be tempted to buy them as well. Now this will definitely enhance the sale and
hence the revenue of business. Benefits Of BI :- Making your Business Intelligent will always
help in every field whether saving time, increasing revenue, to do forecasting, making profit,etc.
There are endless benefits of BI, some of them are listed below :-

Helps in providing more accurate historical data by eliminating guess work.As analysis is
mainly done on huge volume of data.So,accurate historical data will make sure that we get
the correct result.

We can analysecustomers behaviour and taste(i.e. what he thinks,what he likes the most,
what he hates,etc) which can enhance your business and decision making power.

We can easily look where our customer needs more attention and where we dominates the
market in satisfying clients needs.

Complex Business queries are solved with a single click and at a faster rate which saves
lots of time.

Improve efficiency using forecasting.You can analyse data to see where your business has
been, where it is now and where it is going.

Steps involved in BI end to end Solution are :

Integration of data from different data stores using ETL, on which analysis is to be done.

Loaded data is then analyzed for BI engagement.

Representation of the analyzed result in the form of reports,scorecards,dashboards etc.

Business Intelligence Structure:-

Dont panic after looking at these complex words. This expalins the meaning of Business
Intelligence to a large extend.

SSISSSIS stands for SQL Server Integration Services. It is a platform for Data integration and Work
flow applications. It can perform operations like Data Migration and ETL (Extract, Transform and
Load).

E Merging of datafrom heterogeneous data stores (i.e. it may be a text file,


spreadsheets, mainframes, Oracle, etc.).This process is known as EXTRACTION.

T Refreshing data in the data warehouses and data marts. Also used to cleanse data
before loading to remove errors. This process is known as TRANSFORMATION.

L- High-speed load of data into Online Transaction Processing (OLTP) and Online Analytical
Processing (OLAP) databases. This process is known as LOADING.

Tools used for the development of SSIS projects are -

1. BIDS (Business Intelligence Development Studio).


2. SSMS (SQL Server Management Studio).
Note: - Prior to SSIS, the same task was performed with DTS (Data Transformation Services) in
SQL Server 2000 but with fewer features. Difference between DTS and SSIS is as follows:

DTS:-

1. Limited error Handling.


2. Message Boxes in ActiveX Scripts.
3. No deployment wizard and BI functionality.

SSIS :-

1. Complex and powerful error handling.


2. Message Boxes in .NET Scripting.
3. Interactive deployment wizard and Complete BI functionality.
To develop your SSIS package, you need to install Business Intelligence Development Studio
(BIDS) which will be available as client tool after installingSQL Server Management Studio
(SSMS).

BIDS: It is a tool which is used to develop the SSIS packages. It is available with SQL
Server as an interface which provides the developers to work on the control flow of
the package step by step.

SSMS: - It provides different options to make a SSIS package such as Import Export
wizard. With this wizard, we can create a structure on how the data flow should happen.
Created package can be deployed further as per the requirement.

Now, you must be hitting your head to know about Data flow and Control flow. So, Data flow
means extracting data into the servers memory, transform it and write it out to an alternative
destination whereas Control flow means a set of instructions which specify the Program Executor

on how to execute tasks and containers within the SSIS Packages. All these concepts are
explained in SSIS Architecture.SSIS Architecture:1. Packages A package is a collection of tasks framed together with precedence
constraints to manage and execute tasks in an order. It is compiled in a XML structured file
with .dtsx extension.
2. Control Flow - It acts as the brain of a package. It consists of one or more tasks and
containers that executes when package runs. Control flow orchestrates the order of
execution for all its components.
3. Tasks - A task can best be explained as an individual unit of work.
4. Precedence Constraints - These are the arrows in a Control flow of a package that connect
the tasks together and manage the order in which the tasks will execute. In Data flow, these
arrows are known as Service paths.
5. Containers - Core units in the SSIS architecture for grouping tasks together logically into
units of work are known as Containers.
6. Connection Managers - Connection managers are used to centralize connection strings to
data sources and to abstract them from the SSIS packages. Multiple tasks can share the
same Connection manager.
7. Data Flow - The core strength of SSIS is its capability to extract data into the servers
memory (Extraction), transform it (Transformation) and write it out to an alternative
destination (Loading).
8. Sources - A source is a component that you add to the Data Flow design surface to specify
the location of the source data.
9. Transformations - Transformations are key components within the Data Flow that allow
changes to the data within the data pipeline.
10. Destinations - Inside the Data Flow, destinations consume the data after the data pipe
leaves the last transformation components.
11. Variables - Variables can be set to evaluate to an expression at runtime.
12. Parameters - Parameters behave much like variables but with a few main exceptions.
13. Event Handlers The event handlers that run in response to the run-time events that
packages, tasks, and containers raise.
14. Log Providers Logging of package run-time information such as the start time and the
stop time of the package and its tasks and containers.

15. Package Configurations After development your package and before deploying the
package in production environment from UAT you need to perform certain package
configurations as per production Server.
This completes the basics of SSIS and its architecture

SSIS Architecture
Microsoft SQL Server Integration Services (SSIS) consist of four key parts:

SSIS Service
SSIS Object Model

SSIS runtime engine and the runtime executables

SSIS dataflow engine and the dataflow components

Integration Services Service

Monitors running Integration Services packages and manages the storage of


packages
Integration Services object model

Includes native and managed application programming interfaces (API) for


accessing

Integration Services tools, command-line utilities, and custom applications

SSIS Run-time Engine &executables

Runs packages
Supports logging, debugging, config, connections, & transactions

SSIS Run-time executables

Package, Containers, Tasks and Event Handlers

SSIS Data-flow Engine & components

Provides In-Memory buffers to move data


Calls Source Adaptors to files & DBs

Provides Transformations to modify data

Destination Adaptors to load data into data stores

Components

Source, Destination Adaptors & transformations

SQL Server Business Intelligence


Development Studio
SQL Server Business Intelligence Development Studio (BIDS) allows users to create /
edit SSIS packages using a drag-and-drop user interface. BIDS is very user friendly and
allows you to drag-and-drop functionalities. There are a variety of elements that define
a workflow in a single package. Upon package execution, the tool provides color-coded,
real-time monitoring.

Components of SSIS Package include

Control Flow
Data Flow

Control Flow

Control flow deals with orderly processing of tasks, which are individual, isolated units of
work that perform a specific action ending with a finite outcome (such that can be
evaluated as either Success, Failure, or Completion). While their sequence can be
customized by linking them into arbitrary arrangements with precedence constraints
and grouping them together or repeating their execution in a loop with the help of
containers, a subsequent task does not initiate unless its predecessor has completed.

Elements of Control Flow include

Container
Containers provide structure in packages and services to tasks in the control flow.
Integration Services include the following container types, for grouping tasks and
implementing repeating control flows:

The ForeachLoop container: It enumerates a collection and repeats its control


flow for each member of the collection. The ForeachLoop Container is for situations
where you have a collection of items and wish to use each item within it as some kind of
input into the downstream flow.
For Loop Container: Its a basic container that provides looping functionality. A
For loop contains a counter that usually increments (though it sometimes decrements),
at which point a comparison is made with a constant value. If the condition evaluates to
True, then the loop execution continues.
Sequence Container: One special kind of container both conceptually and
physically can hold any other type of container or Control Flow component. It is also
called container container, or super container.

Tasks
Tasks do the work in packages. Integration Services includes tasks for performing a
variety of functions.

The Data Flow task: It defines and runs data flows that extract data, apply
transformations, and load data.
Data preparation tasks: It copies files and directories, downloads files and data,
saves data returned by Web methods, or works with XML documents.

Workflow tasks: It communicates with other processes to run packages or


programs, sends and receives messages between packages, sends e-mail messages,
reads Windows Management Instrumentation (WMI) data, or watch for WMI events.

SQL Server tasks: It accesses, copy, insert, delete, or modify SQL Server objects
and data.

Analysis Services tasks: It creates, modifies, deletes, or processes Analysis


Services objects.

Scripting tasks: It extends package functionality through custom scripts.


Maintenance tasks: It performs administrative functions, such as backing up and
shrinking SQL Server databases, rebuilding and reorganizing indexes, and running SQL
Server Agent jobs.

Precedence constraints
Precedence constraints connect containers and task in packages into an ordered control
flow. You can control the sequence execution for tasks and containers, and specify
conditions that determine whether tasks and containers run.

Data Flow
Its processing responsibilities by employing the pipeline paradigm, carrying data record
by record from its source to a destination and modifying it in transit by applying
transformations. (There are exceptions to this rule, since some of them, such as Sort or
Aggregate require the ability to view the entire data set before handing it over to their

downstream counterparts). Items which are used to creating a data flow categorize into
three parts.

Elements of Data Flow include


Elements of Data Flow are categorized into three parts:

1.

Data Flow Sources: These elements are used to read data from different type
of sources like (SQL Server, Excelsheet, etc.)
2.
Data Flow Transformations: These elements are used to do process on data
like (cleaning, adding new columns, etc.)

3.

Data Flow Destinations: These elements are used save processed data into
desired destination. (SQL Server, Excelsheet, etc.)

Data Flow Source


Different items which can communicate in various types of source data are listed below:

DataReader Source: The DataReadersource uses an ADO.NET connection


manager to read data from a DataReader and channel it into the Data Flow.
Excel Source: The Excel source connects to an Excel file and, selecting content
based on a number of configurable settings, supplies the Data Flow with data. The Excel
Source uses the Excel connectionmanager to connect to the Excel file.
Flat File source: Formats of which include CSV and fixed-width columnsare
still popular. For many reasons, individual circumstances can dictate the use of CSV files

over other formats,which is why the Flat File Source remains a popular Data Flow data
source.

OLE DB Source: The OLEDB Source is used when the data access is performed
via an OLE DB provider. Its a fairly simple data source type, and everyone is familiar
with OLE DB connections.

Raw file Source: The Raw File Source is used to import data that is stored in the
SQL Server raw file format. It is a rapid way to import data that has perhaps been
output by a previous package in the raw format.

XML Source: The XML Source requires an XML Schema Definition (XSD) file,
which is really the most important part of the component because it describes how SSIS
should handle the XML document.
Data Flow Transformation
Items in this category are used to perform different operations to make data in desired
format.

Aggregate: The Aggregate transformation component essentially encapsulates


number of aggregate functions as part of the Data Flow, like Count, Count distinct,
Sum, Average, Minimum, Maximum, Group By with respect to one or more columns.
Audit: The Audit transformation exposes system variables to the Data Flow that
can be used in the stream. This is accomplished by adding columns to the Data Flow
output. When you map the required system variable or variables to the output columns,
the system variables are introduced into the flow and can be used.

Character Map: It performs string manipulations on input columns Like


Lowercase, Uppercase, etc.

Conditional Split: The Conditional Split task splits Data Flow based on a
condition. Depending upon the results of an evaluated expression, data is routed as
specified by the developer.

Copy Column: The Copy Column task makes a copy of a column contained in
the input-columns collection and appends it to the output-columns collection.

Data Conversion: It is converting data from one type to another. Just like Type
Casting.

Data Mining Query: The data-mining implementation in SQL Server 2005 is all
about the discovery of factually correct forecasted trends in data. This is configured
within SSAS against one of the provided data-mining algorithms. The DMX query
requests a predictive set of results from one or more such models built on the same
mining structure. It can be a requirement to retrieve predictive information about the
same data calculated using the different available algorithms.

Derived Column: One or more new columns are appended to the outputcolumns collection based upon the work performed by the task, or the result of the
derived function replaces an existing column value.

Export Column: It is used to extract data from within the input stream and write
it to a file. Theres one caveat: the data type of the column or columns for export must
be DT_TEXT, DT_NTEXT, or DT_IMAGE.

Fuzzy Grouping: Fuzzy Grouping is for use in cleansing data. By setting and
tweaking task properties, you can achieve great results because the task interprets
input data and makes intelligent decisions about its uniqueness.

Fuzzy Lookup: It uses a reference (or lookup) table to find suitable matches. The
reference table needs to be available and selectable as a SQL Server 2005 table. It uses
a configurable fuzzy-matching algorithm to make intelligent matches.

Import Column: It is used to import data from any file or source.

Lookup: The Lookup task leverages reference data and joins between input
columns and columns in the reference data to provide a row-by-row lookup of source
values. This reference data can be a table, view, or dataset.

Merge: The Merge task combines two separate sorted datasets into a single
dataset that is expressed as a single output.

Merge Join: The Merge Join transform uses joins to generate output. Rather than
requiring you to enter a query containing the join, however (for example SELECT
x.columna, y.columnb FROM tablea x INNER JOIN tableb y ON x.joincolumna =
y.joincolumnb), the task editor lets you set it up graphically.

Multicast: The Multicast transform takes an input and makes any number of
copies directed as distinct outputs. Any number of copies can be made of the input.

OLE DB Command: The OLE DB command transform executes a SQL statement


for each row in the input stream. Its kind of like a high-performance cursor in many
ways.

Percentage Sampling: The Percentage Sampling transform generates and


outputs a dataset into the Data Flow based on a sample of data. The sample is entirely
random to represent a valid cross-section of available data.

Pivot: The Pivot transformation essentially encapsulates the functionality of a


pivot query in SQL. A pivot query demoralizes a normalized data set by rotating the
data around a central pointa value.

Row Count: The Row Count task counts the number of rows as they flow through
the component. It uses a specified variable to store the final count. It is a very
lightweight component in that no processing is involved, because the count is just a
property of the input-rows collection.

Row Sampling: The Row Sampling task, in a similar manner to the Percentage
Sampling transform I discussed earlier, is used to create a (pseudo) random selection of
data from the Data Flow. This transform is very useful for performing operations that
would normally be executed against a full set of data held in a table. In very highvolume OLTP databases, however, this just isnt possible at times. The ability to execute
tasks against a representative subset of the data is a suitable and valuable alternative.

Sort: This transform is a step further than the equivalent ORDER BY clause in
the average SQL statement in that it can also strip out duplicate values.

Script Component: The Script Component is using for scripting custom code in
transformation. It can be used not only as a transform but also as a source or a
destination component.

Slowly Changing Dimension: The Slowly Changing Dimension task is used to


maintain dimension tables held in data warehouses. It is a highly specific task that acts
as the conduit between an OLTP database and a related OLAP database.

Term Extraction: This transformation extracts terms from within an input


column and then passes them into the Data Flow as an output column. The source
column data type must be either DT_STR or DT_WSTR.

Term Lookup: This task wraps the functionality of the Term Extraction transform
and uses the values extracted to compare to a reference table, just like the Lookup
transform.

Union All: Just like a Union All statement in SQL, the Union All task combines
any number of inputs into one output. Unlike in the Merge task, no sorting takes place in
this transformation. The columns and data types for the output are created when the
first input is connected to the task.

Unpivot: This task essentially encapsulates the functionality of an unpivot query


in SQL. An unpivot query increases the normalization of a less-normalized or
denormalized data set by rotating the data back around a central pointa value.
Data Flow Destination
Finally, processed data will saved at destination with the help of these items.

Data Mining Model Training: It trains data-mining models using sorted data
contained in the upstream Data Flow. The received data is piped through the SSAS datamining algorithms for the relevant model.
DataReader Destination: The results of an SSIS package executed from a .NET
assembly can be consumed by connecting to the DataReader destination.

Dimension Processing: Dimension Processing is another SSAS-related


destination component. It is used to load and process an SSAS dimension.

Excel Destination: The Excel Destination has a number of options for how the
destination Excel file should be accessed. (Table or View, TableName or ViewName
variable, and SQL Command)

Flat File Destination: The Flat File Destination component writes data out to a
text file in one of the standard flat-file formats: delimited, fixed width, fixed width with
row delimiter.

OLE DB Destination: The OLE DB Destination component inserts data into any
OLE DBcompliant data source.

Partition Processing: The Partition Processing destination type loads and


processes an SSAS partition. In many ways, it is almost exactly the same as the
Dimension Processing destinationat least in terms of configuration. You select or
create an SSAS connection manager, choose the partition to process, and then map
input columns to the columns in the selected partition.

Raw File Destination: The Raw File Destination is all about raw speed. It is an
entirely native format and can be exported and imported more rapidly than any other
connection type, in part because the data doesnt need to pass through a connection
manager.

Recordset Destination: The Recordset Destination creates an instance of an


ActiveX Data Objects (ADO) Recordset and populates it with data from specified input
columns.

SQL Server Destination: The SQL Server Destination provides a connection to


a SQL Server database. Selected columns from the input data are bulk inserted into a
specified table or view. In other words, this destination is used to populate a table held
in a SQL Server database.

SQL Server Mobile Destination: The SQL Server Mobile Destination


component is used to connect and write data to a SQL Server Mobile (or SQL Server
Compact Edition) database

How to create an SSIS Project


1) Open BIDS (Business Intelligence Development Studio)
You should have SQL Server (2005 or higher) installed on your machine with BIDS.
Go to Start // Programs // Microsoft SQL Server (with version you have installed) and
open QL Server Business Intelligence Development Studio

Below is example using Windows 7 and SQL Server 2008 R2.

1) Create new project


In BIDS select File // New // Project

You will get new project dialog box where you should:
Select Business Intelligence Projects in Project Types

Select Integration Services Project in Templates:

Give it a name (Try to avoid spaces for compatibility reasons)

Remember or change location

Click ok to create SSIS Project

How to create an SSIS Package


First create SSIS Project using BIDS (For more information visit Create SSIS Project)
Below is example of an empty package. I have highlighted the elements we will use
and briefly discuss it below (you can ignore the rest):

Solution Explorer - on the right you see solution explorer with your SSIS project (first
icon from top). If you dont have it go to view//solution explorer. In majority of cases
you will use SSIS Packages only. The rest is not used in practise (best practise).
Package tab - In middle we have our package.dtsx opened which contains control
flow, data flow that we will use.
Toolbox - This shows tools (items/tasks) that we can use to build our ETL package.
Toolbox is different for control flow and data flow tabs in the package.
Control Flow - Here you will be able to control your execution steps. For example you
can log certain information before you start the data transfer, you can check if file
exists, and you can send an email when a package fails or finishes. In here you will
also add a task to move data from source to destination however you will use data
flow tab to configure.
Data Flow - This is used to extra source data and define destination. During the "data
flow" you can perform all sorts of transformation for instance create new calculation

fields, perform aggregations and many more.


Lets get to work.
Make sure you are in control flow tab in SSIS Package designer and in the toolbox find
data flowand drag into empty space in control flow pane.
Right click data flow task that you dragged and rename it to Employee Load (and hit
enter to get out from edit mode)

Double click it Employee Load data flow (ensure the box is not selected; otherwise
double click will work like rename). Notice that SSIS automatically goes to data flow
task where you can configure the data flow.
See below screenshot. Which shows that we are not in Data Flow tab and notice the
data flow task drop down box which says Employee Load. You can have multiple
data flow items in control flow so this drop down box allows you to change it.
From the toolbox (while in data flow tab) drag FlatFile Source into empty space.
Right click the source and select rename. Type Employee CSV Source.

Double click the Employee CSV Source. A dialog box will appear with header name
'Flat File Source Editor'.
Next we will create SSIS Package connection which will be stored in the package and
will connect to the CSV file. In order to that click the New button.
Type the connection manager name and description
Click browse button and find the employee.csv file (by default you will see on *.txt file
change it *.csv files)
Once you back tick Column names in the first data row
You should the warning that states that you columns are not defined. Simply click
columns which will set it for you (default settings should be fine).
OK button should be enabled now so click it to complete the process.
On the first dialog box connection manager should say EmployeeCSV click OK to close
the dialog box.

Now from the toolbox lets drag OLE DB destination into data flow empty space and
rename it to Employee Table (OLE DB Destination in toolbox is in Data Flow
Destination tab in toolbox. I thought I will clarify that as it is easy to pick OLE DB
source which is not what we want.)

Now we are going to create data path which means that we define source and its
destination. We will do that by clicking source (once). You should see green arrow.
Click it (once or press and hold) and move it over destination (click or release mouse).
You created "data path" in SSIS Package (Data Flow).

Double click Employee Table Destination.


Create new connection by clicking new button and new button again on another
dialog box pops up.
Put server name. If you are connecting to local server type localhost
Select database from drop down box and click OK on all dialog boxes to confirm your
choices.

Now that new connection is selected. We will create destination table. Notice that I
highlighted data access mode with value table or view fast load this is an important
value that makes the load very quick, make sure you remember this one.
To create new table click New for the table/view drop down box (see below), change
the table name to [Employee] and click ok.
To finish the process click mappings that will create mapping between source fields
and destination fields and click OK

Lets test our SSIS Package. Click run (play button on toolbar). And you should see that
extract from source worked (green), arrows should show 2 rows from our CSV file and
destination should also go green which means it successfully loaded 2 rows from the

file.

Derived Column Transformation:

Steps: Follow steps 1 to 3 on my first article to open the BIDS project and select the
right project to work on integration services project. Once the project is created, we will
see on how to use the Derived Columns control. Once you open the project just drag
and drop the Derived Column control and a source and destination provider as shown in
the below image. Now we need to do the configuration for each of the tasks, first we will
start with Source. In our example we are going to create a table as shown in the below
scripts
CREATE TABLE EmpDetails(EMPID int, EMPFNamevarchar(10), EMPLNamevarchar(10),
EMPDOB Datetime, EMPSalint, EMPHraint) GO
INSERT INTO EmpDetails (EMPID, EMPFName, EMPLName, EMPDOB, EMPSal, EMPHra)
VALUES (1,Karthik,'Anbu,01/01/1980, 10000,1500) ,(2,Arun,'Kumar,02/02/1981,
8000,1200) ,(3,Ram,'Kumar,01/02/1982, 6000,1000)
Now configure the source to get the details from the table above. Once the source is
configured now we need to do the configuration for the destination section. So here we
are going to create a new table as shown in the below script
CREATE TABLE EmpDetailsDestination (EmpFullNamevarchar(21), EmpAgeint,
EmpCTCint, InsertedDate DATETIME)
Now the records in both the source and destination tables are shown in the below
screen Our primary goal is to do some manipulations using the derived column task and
save it in a separate table. So we are configure the Derived Column by double clicking
the control will open the window for configuration as shown in the below screen In the
expression section if you see we have created some expressions to do some
manipulations as per our requirement. Now we need to do the configuration for the

destination source by mapping the columns as shown in the below screen Now once all
the task steps are configured press F5 to build and execute the package. Once your
package is executed your screen looks like below We can see the output in the
destination table as expected.

Merge Join

Merge multiple data sources with SQL Server Integration Services.

Problem
When loading data into SQL Server you have the option of using SQL Server Integration
Services to handle more complex loading and data transforms then just doing a straight
load such as using BCP. One problem that you may be faced with is that data is given to
you in multiple files such as sales and sales orders, but the loading process requires you to
join these flat files during the load instead of doing a preload and then later merging the
data. What options exist and how can this be done?
Solution
SQL Server Integration Services (SSIS) offers a lot more features and options then DTS
offered. One of these new options is the MERGE JOIN task. With this task you can merge
multiple input files into one process and handle this source data as if it was from one
source.
Let's take a look at an example of how to use this.
Here we have two source files an OrderHeader and an OrderDetail. We want to merge this
data and load into one table in SQL Server called Orders.
OrderHeader source file.

OrderDetail source file

Orders table

Building the SSIS Package


First create a new SSIS package and create the three Connections that we will need.
1. Flat File Source 1 - OrderHeader
2. Flat File Source 2 - OrderDetail
3. OLE DB Destination - SQLServer

Then add a DATA FLOW task.

Next we need to build our load from these two flat file sources and then use the MERGE
JOIN task to merge the data. So the Data Flow steps would look something like this.

At this point if you try to edit the MERGE JOIN task you will get the below error. The reason
for this is because the data needs to be sorted for the MERGE JOIN task to work. We will
look at two options for handling this sorting need.

Option #1 - Data is presorted prior to loading the data.


Let's assume that are data is sorted prior to loading. We therefore need to tell SSIS this is
the case as well as show which column the data is sorted on. First if you right click on "Flat
File Source" and select the "Show Advanced Editor". On the Input and Output Properties
tab you need to change the "IsSorted" to True for both of the Flat File Sources.

Next you need to let SSIS know which column is the SortKey. Here we are specifying the
OrderID column. This also needs to be done for both of the flat file sources.

Once this is complete you will be able to move on with the setup and select the input
process as shown below.

From here you can select the columns that you want to have for output as well as determine
what type of join you want to employ between these two files.

Lastly you would need to add your OLE Destination, select the table and map the columns to
finish the process.

Option #2 - Source data is not sorted


With this load process, let's assume the source data is not sorted first, so we need to use
the SORT task to sort the data prior to using the MERGE JOIN task. The following shows our
Flat File sources and then a SORT task after each one of these and then lastly our MERGE
JOIN task.

If you right click the Sort task and select Edit you will get a screen such as following. Here
you need to select which column the data should be sorted on. This needs to be done for
both of the flat source files.

After this is done you can move on and finish the load process. The MERGE JOIN works just
like it was stated above as well as the OLE DB Destination.

Lookup Transformation
The Lookup transformation performs lookups by joining data in input columns with columns in a
reference dataset. We use the lookup to access additional information in a related table that is based
on values in common join columns. Lookup transformation dataset can be a cache file, an existing
table or view, a new table, or the result of an SQL query.

Implementation
In this scenario we want to get the department name and location information from the department
table for each corresponding employee record from the source employee table.

Here we have the EMP table as OLEDB Source, next the DEPT table as the Lookup dataset and finally
the OLEDB Destination table to stage the data.

Next we double-click the Lookup transformation to go to the Editor. Select the Connection type to
OLEDB connection manager. When required the Lookup dataset can be a Cache file.

Cache Mode
There are three types of caching options available to be configured- Full cache, Partial cache and No
cache. In case of Full cache, the Lookup transformation generates a warning while caching, when the
transformation detects duplicates in the join key of the reference dataset.

Next we select the OLEDB connection object from the OLEDB connection manager browser. Next we
specify the table or view. We can also use the resultant dataset of an SQL statement as Lookup
reference as mentioned earlier if required.

Next we define the simple equi join condition between the Source Input Columns and the Reference
Lookup Available columns. Next we define the Lookup Columns as Output. We can rename or Alias the
Reference Lookup column name if required.

Next in case of Partial Cache mode we can specify the Cache size here. Also we can modify the
Custom query if required.

Select Ignore failure for Error. If there is no matching entry in the reference dataset, no join occurs.
By default, the Lookup transformation treats rows without matching entries as errors. However, if we
configure the Lookup transformation to Ignore lookup failure then such rows are redirected to no match
output.

Lookup Output
The Lookup transformation has the following outputs:

Match output- It handles the rows in the transformation input that matches at least one entry
in the reference dataset.

No Match output- It handles rows in the input that do not match any entry in the reference
dataset.

As mentioned earlier, if Lookup transformation is configured to treat the rows without matching
entries as errors, the rows are redirected to the error output else they are redirected to the no
match output.

Error output- It handles the error records.

Lets go to Lookup transformation Advanced Editor.

Below Lookup Transform Advanced Editor- Component Properties

Below Lookup Transform Advanced Editor- Input Columns

Below Lookup Transform Advanced Editor- Input & Output Properties.

Fuzzy Lookup

Select "Fuzzy Lookup" from "Data Flow Transformation" and Drag it on "Data Flow" tab.
And connect extended green arrow from OLE DB Source to your fuzzy lookup. Double
click on Fuzzy Lookup task to configure it.

Select "OLE DB Connection" and "Reference Table name" in "Reference Table" tab.

Map Lookup column and Output Column in "Columns tab. Add prefix "Ref_" in output
column filed.

Let all value as it is in "Advanced" tab.

Select "Conditional Split" from "Data Flow Transformation" and Drag it on "Data Flow"
tab. and connect extended green arrow from Fuzzy Lookup to your "Conditional Split".
Double click on Conditional Split task to configure it.

Create two output. One is "Solid Matched" which Condition is "_Similarity > 0.85 &&
_Confidence > 0.8" and another is "Likely Matched" which condition is "_Similarity > .65
&& _Confidence > 0.75". Click OK.

Select "Derived Column" from "Data Flow Transformation" and Drag it on "Data Flow"
tab. and connect extended green arrow from Conditional Split to your "Derived
Column".

Select Output as "Solid Matched" and click OK.

Double click on Derived Column task to configure it.

Select another "Derived Column" from "Data Flow Transformation" and Drag it on "Data
Flow" tab. and connect extended green arrow from Conditional Split to your "Derived
Column 1".

Select Output as "Likely Matched" and click OK.

Double click on Derived Column 1 task to configure it.

Select another "Derived Column" from "Data Flow Transformation" and Drag it on "Data
Flow" tab. And connect extended green arrow from Conditional Split to your "Derived
Column 2".

Double click on Derived Column 2 task to configure it.

Select another "Union All" from "Data Flow Transformation" and Drag it on "Data Flow"
tab. and connect extended green arrow from Derived Column to your "Union All" and
Derived Column 1 to your "Union All" and Derived Column 2 to your "Union All".

Double click on Union All task to configure it.

Select "SQL Server Destination" from "Data Flow Destination" and Drag it on "Data
Flow" tab. and connect extended green arrow from Union All to your "SQL Server
Destination".

Double click on SQL Server Destination task to configure it. Click New for create a New
Table or Select from List.

Click OK.

If you execute the package with debugging (press F5), the package should succeed and
appear as shown here:

SELECT [firstName]
,[LastName]
,[Ref_firstName]
,[Ref_LastName]
,[_Similarity]
,[_Confidence]
,[_Similarity_firstName]
,[_Similarity_LastName]
,[_Match]

FROM [Test].[dbo].[SQL Server Destination]


Pivot Transform
There is many times required to convert the rows to columns to visualize data in
a different way. Pivot Transform in SSIS helps to perform the task.

Example:
Data looks like:
Product
iPhone
iPad
iPhone
iPod
iPad
iPod
iPhone
iPad
iPod

Color
White
White
Pink
White
Pink
Pink
orange
orange
orange

Price
199
300
250
50
350
75
150
399
50

Using Pivot on Color for Price as value will result in to:

Product
iPad
iPhone
iPod

orang
e
399
150
50

Pink
350
250
75

White
300
199
50

In other words normalized table with redundancy can be converted to De


normalized table using Pivot. You can use Pivot t-SQL to perform above task as
well as Pivot Transformation in SSIS. Pivot Transformation is little bit tricky.
1. Source :
Query:

use AdventureWorks
select
YEAR(OrderDate) asYear,
pc.Name as ProductCategoryName,
SUM(linetotal) as LineTotal
from
Production.Product p
join Production.ProductSubcategory ps
on p.ProductSubcategoryID=ps.ProductSubcategoryID
join production.ProductCategory pc

on pc.ProductCategoryID=ps.ProductCategoryID
join sales.SalesOrderDetail sod
on sod.ProductID=p.ProductID
join sales.SalesOrderHeader soh
on soh.SalesOrderID=sod.SalesOrderID
groupby
YEAR(OrderDate),
pc.Name

Produces following result :

Year
200
1
200
1
200
1
200
1
200
2
200
2
200
2
200
2
200
3
200
3
200
3
200
3
200
4
200
4
200
4
200
4

ProductCategoryNa
me
Accessories
Bikes
Clothing
Components
Accessories
Bikes
Clothing
Components
Accessories
Bikes
Clothing
Components
Accessories
Bikes
Clothing
Components

LineTotal
20235.364
61
10661722.
28
34376.335
25
615474.97
88
92735.351
71
26486358.
2
485587.15
28
3610092.4
72
590257.58
52
34923280.
24
1011984.5
04
5485514.8
32
568844.58
24
22579811.
98
588594.53
23
2091511.0
04

o Destination :
Pivoting ProductCategoryName using LineTotal value will result in:

Year
2001
2002
2003
2004

Accessor
ies
20235.36
461
92735.35
171
590257.5
852
568844.5
824

Bikes
10661722
.28
26486358
.2
34923280
.24
22579811
.98

Clothi
ng
34376.
34
48558
7.2
10119
85
58859
4.5

Compone
nts
615474.9
788
3610092.
472
5485514.
832
2091511.
004

Create table in destination:

USE temp db
GO

CREATETABLE [dbo].[Pivot_Example](
[Year] [int] NULL,
[Accessories] [float] NULL,
[Bikes] [float] NULL,
[Clothing] [float] NULL,
[Components] [float] NULL
) ON [PRIMARY]
GO

Steps to use Pivot Transform :


1)
2)
3)

Configure OLE DB Source and use above query as Source in data


flow task.
Drag and open Pivot Transform and go to Input Columns. Select all
inputs as we are going to use all of them in Pivot.
Go to Input and output properties and expand Pivot Default Input.
Here we will configure how inputs will be used in Pivot operations
using Pivot key Value.

Pivot Key
Value

Our example
columns

Function

the column is passed through unaffected


the column values become the rows of the pivot
the columnvalues become the column names of
2 the pivot
the column values that are pivoted in the pivot
3
0
1

Year
ProductCategoryNa
me
LineTotal

Similarly do it for ProductCategoryName with PivotUsage= 2 and


LineTotal with PivotUsage= 3.

Note: Input columns which are used as Pivot Usage =1 should be


sorted before Pivot Transform. (See order by at the end of Source
Query)
4)
Expand Pivot Default Output, Click on the Output Columns and
click AddColumn. Please note that our destination has Five
Columns, all Columns needs to be manually created in this section.

Note:

Name The name for the output column


PivotKeyValue The value in the pivoted column that will go into this
output.

Source Column: It is the lineage ID of the input column which holds the
value for the output column.
In our above example:
Output Column
Year
Accessories
Bikes
Clothing
Components

5)

Lineage
Lineage
Lineage
Lineage
Lineage
Lineage

ID
ID of Year Input Column
ID of LineTotal Input Column
ID of LineTotal Input Column
ID of LineTotal Input Column
ID of LineTotal Input Column

Bring OLE DB destination and Map Columns.

in SSIS and as such has its own workspace, which is represented by the Data Flow tab in SSIS Designer,
as shown in Figure 1.

Figure 1: The Data Flow tab in SSIS Designer


Before we can do anything on the Data Flow tab, we must first add a Data Flow task to our control flow. To
add the task, drag it from the Control Flow Items window to the Control Flow tab of the SSIS Designer
screen, as illustrated in Figure 2.

Figure 2: Adding a Data Flow task to the control flow

To configure the data flow, double-click the Data Flow task in the control flow. This will move you to the Data
Flow tab, shown in Figure 3.

Figure 3: The Data Flow tab in SSIS Designer

Configuring the Data Flow


You configure a Data Flow task by adding components to the Data Flow tab. SSIS supports three types of
data flow components:

Sources: Where the data comes from


Transformations: How you can modify the data

Destinations: Where you want to put the data

A Data Flow task will always start with a source and will usually end with a destination, but not always. You
can also add as many transformations as necessary to prepare the data for the destination. For example, you
can use the Derived Column transformation to add a computed column to the data flow, or you can use a
Conditional Split transformation to split data into different destinations based on specified criteria. This
and other components will be explained in future articles.
To add components to the Data Flow task, you need to open the Toolbox if its not already open. To do this,
point to the View menu and then click ToolBox, as shown in Figure 4.

Figure 4: Opening the Toolbox to view the data flow components


At the left side of the Data Flow tab, you should now find the Toolbox window, which lists the various
components you can add to your data flow. The Toolbox organizes the components according to their
function, as shown in Figure 5.

Figure 5: The component categories as they appear in the Toolbox


To view the actual components, you must expand the categories. For example, to view the source components,
you must expand the Data Flow Sources category, as shown in Figure 6

Figure 6: Viewing the data flow source components

Adding an OLE DB Source


The first component were going to add to the data flow is a source. Because were going to be retrieving data
from a SQL Server database, well use an OLE DB source. To add the component, expand the Data Flow
Sources category in the Toolbox. Then drag an OLE DB source from to the Data Flow window. Your data
flow should now look similar to Figure 7.

Figure 7: Adding an OLE DB source to your data flow


You will see that we have a new item named OLE DB Source. You can rename the component by right-clicking
it and selecting rename. For this example, I renamed it Employees.
There are several other features about the OLE DB source noting:

A database icon is associated with that source type. Other source types will show different icons.
A reversed red X appears to the right of the name. This indicates that the component has not yet been
properly configured.

Two arrows extend below the component. These are called data paths. In this case, there is one green
and one red. The green data path marks the flow of data that has no errors. The red data path

redirects rows whose values are truncated or that generate an error. Together these data paths enable
the developer to specifically control the flow of data, even if errors are present.
To configure the OLE DB source, right-click the component and then click Edit. The OLE DB Source Editor
appears, as shown in Figure 8.

Figure 8: Configuring the OLEDB source


From the OLE DB connection manager drop-down list, select the OLE DB connection manager we set up
in the last article, the one that connects to the AdventureWorks database.
Next, you must select one of the following four options from the Data access mode drop-down list:

Table or view
Table name or view name variable

SQL command

SQL command from variable

For this example, well select the Table or View option because well be retrieving our data through the
uvw_GetEmployeePayRate view, which returns the latest employee pay raise and the amount of that raise.
Listing 1 shows the Transact-SQL used to create the view in the AdventureWorks database.

CREATEVIEWuvw_GetEmployeePayRate
AS
SELECTH.EmployeeID ,
RateChangeDate ,
Rate
FROMHumanResources.EmployeePayHistory H

JOIN( SELECTEmployeeID ,
MAX(RateChangeDate) AS [MaxDate]
FROMHumanResources.EmployeePayHistory
GROUPBYEmployeeID
) xx ONH.EmployeeID = xx.EmployeeID
ANDH.RateChangeDate = xx.MaxDate
GO
Listing 1: The uvw_GetEmployeePayRate view definition
After you ensure that Table or view is selected in the Data access mode drop-down list, select the
uvw_GetEmployeePayRate view from the Name of the table or the view drop-down list. Now go to
the Columns page to select the columns that will be returned from the data source. By default, all columns are
selected. Figure 9 shows the columns (EmployeeID, RateChangeDate, and Rate) that will be added to the
data flow for our package, as they appear on the Columns page.

Figure 9: The Columns page of the OLE DB Source Editor


If there are columns you dont wish to use, you can simply uncheck them in the Available External
Columns box.
Now click on the Error Output page (shown in Figure 10) to view the actions that the SSIS package will take
if it encounters errors.

Figure 10: The Error Output page of the OLE DB Source Editor
By default, if there is an error or truncation, the component will fail. You can override the default behavior, but
explaining how to do that is beyond the scope of this article. Youll learn about error handling in future articles.
Now return to the Connection Manager page and click the Preview button to view a sample dataset in the
Preview Query Results window, shown in Figure 11. Previewing the data ensures that what is being
returned is what you are expecting.

Figure 11: Previewing a sample dataset


After youve configured the OLE DB Source component, click OK.

Adding a Derived Column Transformation

The next step in configuring our data flow is to add a transformation component. In this case, well add the
Derived Column transformation to create a column that calculates the annual pay increase for each
employee record we retrieve through the OLE DB source.
To add the component, expand the Data Flow Transformations category in the Toolbox window, and
drag the Derived Column transformation (shown in Figure 12) to the Data Flow tab design surface.

Figure 12: The Derived Column transformation as its listed in the Toolbox
Drag the green data path from the OLE DB source to the Derived Column transformation to associate the
two components, as shown in Figure 13. (If you dont connect the two components, they wont be linked and,
as a result, you wont be able to edit the transformation.)

Figure 13: Using the data path to connect the two components
The next step is to configure the Derived Column component. Double-click the component to open the
Derived Column Transformation Editor, as shown in Figure 14.

Figure 14: Configuring the Derived Column transformation


This editor is made up of three regions, which Ive labeled 1, 2 and 3:
1.
2.

Objects you can use as a starting point. For example you can either select columns from your data
flow or select a variable. (We will be working with variables in a future article.)
Functions and operators you can use in your derived column expression. For example, you can use a
mathematical function to calculate data returned from a column or use a date/time function to extract
the year from a selected date.

3.

Workspace where you build one or more derived columns. Each row in the grid contains the details
necessary to define a derived column.

For this exercise, well be creating a derived column that calculates a pay raise for employees. The first step is
to select the existing column that will be the basis for our new column.
To select the column, expand the Columns node, and drag the Rate column to the Expression column of
the first row in the derived columns grid, as shown in Figure 15.

Figure 15: Adding a column to the Expression column of the derived column grid
When you add your column to the Expression column, SSIS prepopulates the other columns in that row of
the grid, as shown in Figure 16.

Figure 16: Prepopulated values in derived column grid


As you can see, SSIS has assigned our derived column the name Derived Column 1 and set the Derived
Column value to <add as new column>. In addition, our [Rate] field now appears in the Expression
column, and the currency[DT_CY] value has been assigned to the Data Type column.
You can change the Derived Column Name value by simply typing a new name in the box. For this example,
Ive renamed the column NewPayRate.
For the Derived Column value, you can choose to add a new column to your data flow (which is the default
value, <add as new column>) or to replace one of the existing columns in your data flow. In this instance,
well add a new column, but there may be times when overwriting a column is required.
The data type is automatically created by the system and cant be changed at this stage.
Our next step is to refine our expression. Currently, because only the Rate column is included in the
expression, the derived column will return the existing values in that column. However, we want to calculate a
new pay rate. The first step, then, is to add an operator. To view the list of available operators, expand the list
and scroll through them. Some of the operators are for string functions and some for math functions.
To increase the employees pay rate by 5%, well use the following calculation:
[Rate] * 1.05
To do this in the Expression box, either type the multiplication operator (*), or drag it from the list of operators
to our expression (just after the column name), and then type 1.05, as shown in Figure 17.

Figure 17: Defining an expression for our derived column


You will see that the Data Type has now changed to numeric [DT_NUMERIC].

Once you are happy with the expression, click on OK to complete the process. You will be returned to the Data
Flow tab. From here, you can rename the Derived Column transformation to clearly show what it does.
Again, there are two data paths to use to link to further transformations or to connect to destinations.

Adding an Excel Destination


Now we need to add a destination to our data flow to enable us to export our results into an Excel
spreadsheet.
To add the destination, expand the Data Flow Destinations category in the Toolbox, and drag the Excel
destination to the SSIS Designer workspace, as shown in Figure 18.

Figure 18: Adding an Excel destination to your data flow


Now connect the green data path from the Derived Column transformation to the Excel destination to
associate the two components, as shown in Figure 19.

Figure 19: Connecting the data path from the transformation to the destination
As you can see, even though we have connected the PayRate transformation to the Excel destination, we
still have the reversed red X showing us that there is a connection issue. This is because we have not yet
selected the connection manager or linked the data flow columns to those in the Excel destination.
Next, right-click the Excel destination, and click Edit. This launches the Excel Destination Editor
dialog box, shown in Figure 20. On the Connection Manager page, under OLE DB connection
manager, click on the Newbutton then under Excel File Path click on the Browse button and select the file
you created in the previous article and click on OK, then under Name of the Excel Sheet select the
appropriate sheet from the file.

Figure 20: Configuring the Excel destination component


At the bottom of the Connection Manager page, youll notice a message that indicates we havent mapped
the source columns with the destination columns. To do this, go to the Mappings page (shown in Figure 21)
and ensure that the columns in the data flow (the input columns) map correctly to the columns in the
destination Excel file. The package will make a best guess based on field names; however, for this example,
I have purposefully named my columns in the excel spreadsheet differently from those in the source database
so they wouldnt be matched automatically.

Figure 21: The Mappings page of the Excel Destination Editor


To match the remaining columns, click the column name in the Input Column grid at the bottom of the page,
and select the correct column. As you select the column, the list will be reduced so that only those columns not
linked are available. At the same time, the source and destination columns in the top diagram will be connected
by arrows, as shown in Figure 22.

Figure 22: Mapping the columns between the data flow and the destination
Once youve properly mapped the columns, click OK. The Data Flow tab should now look similar to the
screenshot in Figure 23.

Figure 23: The configured data flow in your SSIS package

Running an SSIS Package in BIDS


Now all we need to do is execute the package and see if it works. To do this, click the Execute button. Its the
green arrow on the toolbar, as shown in Figure 24.

Figure 24: Clicking the Execute button to run your SSIS package
As the package progresses through the data flow components, each one will change color. The component will
turn yellow while it is running, then turn green or red on completion. If it turns green, it has run successfully, and
if it turns red, it has failed. Note, however, that if a component runs too quickly, you wont see it turn yellow.
Instead, it will go straight from white to green or red.
The Data Flow tab also shows the number of rows that are processed along each step of the way. That
number is displayed next to the data path. For our example package, 290 rows were processed between the
Employees source and the PayRate transformation, and 290 rows were processed between the
transformation and the Excel destination. Figure 25 shows the data flow after the three components ran
successfully. Note that the number of processed rows are also displayed.

Figure 25: The data flow after if has completed running


You can also find details about the packages execution on the Progress tab (shown in Figure 26). The tab
displays each step of the execution process. If there is an error, a red exclamation mark is displayed next to the
steps description. If there is a warning, a yellow exclamation mark is displayed. We will go into resolving errors
and how to find them in a future article.

Figure 26: The Progress tab in SSIS Designer


Now all thats needed is to check the Excel file to ensure that the data was properly added. You should
expect to see results similar to those in Figure 27.

Figure 27: Reviewing the Excel file after package execution


SCD(Slow changing Dimesion)

This transformation is used to implement Type 1 and Type 2 SCD, for


other types we need to add some custom logic to our ETL. (Type 1:
Identify the changes and update the record, no history will be
recorded, Type 2: Any change identified we expire the old record and
create new record with new values, here we save history information
in old record)
OK..Lets take simple Employee dimension example... in this example i
am getting EmployeeFeed.xls file as input for my DimEmployee table
(which is my dimension table) and i am using SCD transformation to
identify any changes and implement DimEmployee as Type 2.
DimEmployee :
Create tableDimEmployee

EmpKeyintidentity(1,1),

EmpIdint,

Name varchar(100),

Designation varchar(100),

City varchar(100),

Phone varchar(10),

StartDatedatetime,

EndDatedatetime

So before we start implementing any SCD, we need to first identify


attribute in the dimension table for which we need to track changes. In
this example i want to track changes for Designation, City and Phone
attributes. I am expecting no change in Name attribute or column.
You might have noticed that there are two columns EmpId and
EmpKey why these columns are needed in dimension table??
Ans:
EmpId : This is a Business Key, which uniquely identifies a employee in
entire data warehouse system.
EmpKey : Is a Surrogate key, which uniquely identifies record in
dimension table, and also its a key to identify historical records.

We also have two more columns StartDate and EndDate, these


two columns are used to track time of changes, if EndDate is null it
means the record is most recent record in dimension table.
Steps to Implement SCD in a data flow.
1. After we add Source (which is excel in our case EmployeeFeed.xls),
we need to add Data conversion transformation, to correct if there are
any Data type conflicts.
2. Then we add SCD transformation to Data flow, and this will open
SCD wizard, Click next on welcome screen.

3. On Select a Dimensoin Table And Keys page, select your dimension


table in this case its DimEmployee, Map all the columns from source
excel to destination DimEmployee table. One important thing here we
do is identify Business Key, which in our case is EmpID. Then click
Next

4. On Slowly changing dimension columns page, we need to select


appropriate change type of the Dimension Columns and here we have
three types :
Fixed Attribute--> No change expected
Changing Attribute --> Changes are expected, but no need to record
history, same record will be updated.
Historical Attribute--> If this attribute is changed, old record will be
expired (by setting EndDate as current date) and new record will be
inserted with new attribute value
In our example, we don't expect any change for Name Attribute hence
we selected this as Fixed Attribute, and rest all (Phone,Designation
and City ) will be selected as Historical Attribute. Once we are done
Click Next

5. On Historical Attribute Options Screen, we have two option, we can


use any flag column to show which record is expired and which is most
recent and other option is to use StartDate and EndDate, in this
example we are using second option, and also selected StartDate and
EndDate column appropriately.

6. For all other screens in this wizard just select next, and on
last screen select Finish.
That's it..we implemented SCD transformation...your data flow should
look like as shown below.

If you have noticed, we have two outputs from SCD transformation,


New Output and Historical Attribute Output. So if there are any new
records which are not present in dimension table those records will be
redirected to New Output, and all existing records with some changing
attributes will be redirected to Historical Attribute output.
Running the data flow..
I have 9 records in my sample EmployeeFeed.xls file..

So when i run my data flow for first time, all these 9 records will be
redirected to New Output and will be inserted to DimEmployee Table.
Next, I did some changes in EmployeeFeed.xls, Changes are marked in
yellow... so there are 4 records which are changed and 2 new records
added.

If you can see the data flow, two records are redirected through New
Output pipeline and 4 moved through Historical Attribute output, so
what happens to those 4 records is we update the EndDate to latest
date, then again insert them with new changed attrribute keeping
EndDate as null. as shown below.

SCD(Slowly changing Dimesion)


This transformation is used to implement Type 1 and Type 2 SCD, for
other types we need to add some custom logic to our ETL. (Type 1:
Identify the changes and update the record, no history will be
recorded, Type 2: Any change identified we expire the old record and
create new record with new values, here we save history information
in old record)
OK..Lets take simple Employee dimension example... in this example i
am getting EmployeeFeed.xls file as input for my DimEmployee table
(which is my dimension table) and i am using SCD transformation to
identify any changes and implement DimEmployee as Type 2.
DimEmployee :
Create tableDimEmployee

EmpKeyintidentity(1,1),

EmpIdint,

Name varchar(100),

Designation varchar(100),

City varchar(100),

Phone varchar(10),

StartDatedatetime,

EndDatedatetime

So before we start implementing any SCD, we need to first identify


attribute in the dimension table for which we need to track changes. In
this example i want to track changes for Designation, City and Phone
attributes. I am expecting no change in Name attribute or column.
You might have noticed that there are two columns EmpId and
EmpKey why these columns are needed in dimension table??
Ans:
EmpId : This is a Business Key, which uniquely identifies a employee in
entire data warehouse system.
EmpKey : Is a Surrogate key, which uniquely identifies record in
dimension table, and also its a key to identify historical records.
We also have two more columns StartDate and EndDate, these
two columns are used to track time of changes, if EndDate is null it
means the record is most recent record in dimension table.
Steps to Implement SCD in a data flow.
1. After we add Source (which is excel in our case EmployeeFeed.xls),

we need to add Data conversion transformation, to correct if there are


any Data type conflicts.
2. Then we add SCD transformation to Data flow, and this will open
SCD wizard, Click next on welcome screen.

3. On Select a Dimensoin Table And Keys page, select your dimension


table in this case its DimEmployee, Map all the columns from source
excel to destination DimEmployee table. One important thing here we
do is identify Business Key, which in our case is EmpID. Then click
Next

4. On Slowly changing dimension columns page, we need to select


appropriate change type of the Dimension Columns and here we have

three types :
Fixed Attribute--> No change expected
Changing Attribute --> Changes are expected, but no need to record
history, same record will be updated.
Historical Attribute--> If this attribute is changed, old record will be
expired (by setting EndDate as current date) and new record will be
inserted with new attribute value
In our example, we don't expect any change for Name Attribute hence
we selected this as Fixed Attribute, and rest all (Phone,Designation
and City ) will be selected as Historical Attribute. Once we are done
Click Next

5. On Historical Attribute Options Screen, we have two option, we can


use any flag column to show which record is expired and which is most
recent and other option is to use StartDate and EndDate, in this
example we are using second option, and also selected StartDate and
EndDate column appropriately.

6. For all other screens in this wizard just select next, and on
last screen select Finish.
That's it..we implemented SCD transformation...your data flow should
look like as shown below.

If you have noticed, we have two outputs from SCD transformation,


New Output and Historical Attribute Output. So if there are any new
records which are not present in dimension table those records will be
redirected to New Output, and all existing records with some changing
attributes will be redirected to Historical Attribute output.
Running the data flow..

I have 9 records in my sample EmployeeFeed.xls file..

So when i run my data flow for first time, all these 9 records will be
redirected to New Output and will be inserted to DimEmployee Table.
Next, I did some changes in EmployeeFeed.xls, Changes are marked in
yellow... so there are 4 records which are changed and 2 new records
added.

If you can see the data flow, two records are redirected through New
Output pipeline and 4 moved through Historical Attribute output, so
what happens to those 4 records is we update the EndDate to latest
date, then again insert them with new changed attrribute keeping
EndDate as null. as shown below.

File System,For loop and for each loop Control flow tasksIn some ETL scenarios, when processing files, it is necessary to rename the
already processed files and move them to a different location. In SSIS you
can accomplish that in a single step using the File System Task. The example
I have prepared assumes the package will process a set of files using a
ForEach Loop container; then for each file, using the 'Rename' operation in
File System Task will do both; rename and move the file.
Here are some screen shots and notes about the package:
First of all, the big picture. The control flow has a ForEach Loop Container
with a File System Task inside. Notice that the DataFlow task is empty and it
is intended to show where the real ETL work should go; but this can be

different or not required at all.

Then details about the ForEach Loop container. Basically ,this container is
configured to process all *.txt files in C:\Temp\Source folder, where all the
files 'to be processed' are expected to be.

Now the trick, few variables, some of them using expressions:

The expressions are:


in FullSourcePathFileName:
@[User::SourcePath] + @[User::MyFileValue]
in FullArchivePathFileName:
@[User::ArchivePath] + SUBSTRING( @[User::MyFileValue] , 1 ,
FINDSTRING( @[User::MyFileValue],".",1) - 1 ) + "-" + (DT_STR, 2, 1252)
Month( @[System::StartTime] )+ (DT_STR, 4, 1252)
Year( @[System::StartTime] )+ SUBSTRING( @[User::MyFileValue] ,
FINDSTRING( @[User::MyFileValue],".",1) , LEN( @[User::MyFileValue] ) )
Notice that SourcePath and ArchivePath variables hold only the origin and
destination paths of the files.
Note: Make sure you set EvaluateAsExpression property of the variable as
TRUE.
Lastly, the File System Task should be configured like this:

I am pretty sure there are different ways of accomplishing this simple task;
but I like this one because it does not require writing custom code and relies
on expressions
SSIS For Loop Containers
The For Loop is one of two Loop containers available in SSIS. In my opinion it
is easier to set up and use than the For Each Loop, but it is just as useful. The
basic Function of the for loop is to loop over whatever tasks you put inside
the container a predetermined number of times, or until a condition is met.
The For Loop Container, as is true of all the containers in SSIS, supports
transactions by setting the Transaction Option in the properties pane of the
container to ?Required?, or ?Supported? if a parent container, or the package
itself is set to ?Required?
There are three expressions that control the number of times the
loop executes in the For Loop container.
1. The InitExpression is the first expression to be evaluated on the For
Loop and is only evaluated once at the beginning. This expression is
optional in the For Loop Container. It is evaluated before any work is
done inside the loop. Typically you use it to set the initial value for the
variable that will be used in the other expressions in the For Loop

Container. You can also use it to initialize a variable that might be used
in the workflow of the loop.
2. The EvalExpression is the second expression evaluated when the
loop first starts. This expression is not optional. It is also evaluated
before any work is performed inside the container, and then evaluated
at the beginning of each loop. This is the expression that determines if
the loop continues or terminates. If the expression entered evaluates
to TRUE, the loop executes again. If it evaluates to FALSE, the loop
ends. Make sure to pay particular attention to this expression. I will
admit that I have accidentally written an expression in the
EvalExpression that evaluates to False right away and terminated the
loop before any work was done, and it took me longer than it probably
should have to figure out that the EvalExpression was the reason why
it was wrong.
3. The AssignExpression is the last expression used in the For Loop. It is
used to change the value of the variable used in the EvalExpression.
This expression is evaluated for each pass through the loop as well, but
at the end of the workflow. This expression is optional.

Lets walk through setting up an example of the package. In this example we?
ll create a loop that executes a given number of times.
Create a new package and add two variables to it, intStartVal and intEndVal.

Next add a For Loop Container to the package and open the editor. Assign
the following values for the expressions:

That is all the configuring that is required for the For Loop Container. Now
lets add a Script Task that will display a message box with the value of the

intStartVal variable as the loop updates the value of that variable. Here is the
code to do that:
Public Sub Main()
'
MsgBox(Dts.Variables("intStartVal").Value)
'
Dts.TaskResult = ScriptResults.Success
End Sub
Once that is done the package is ready to execute.

First Iteration

Second Iteration

Fifth Iteration

Complete

Adding Your Variables


When you use the Foreach Loop container to loop through a collection, you need to
define a variable that will provide a pointer to the current member, as you loop through
the collection. You can define that variable in advance or when you configure the
Foreach Loop container. In this case, I create the variable in advance so its ready when I
need it. I assign the name JobTitle to the variable and configure it with the String data
type. For its value, I use a set of quotation marks to represent an empty string;
however, you can specify any initial value. If youre going to implement breakpoints and
set up watches to monitor variable values when you run the package, then you might

want to assign a meaningful value to the JobTitle variable to provide a better milepost
during the iterative process.
Next, I create a variable named JobTitles to hold the collection itself. You do not always
need to create a second variable. It depends on the collection type. In this case,
because Ill be retrieving data from a view, I need a variable to hold the result set
returned by my query, and that variable must be configured with the Object data type.
However, I dont need to assign an initial value to the variable. The value System.Object
is automatically inserted, as shown in Figure 1.

Figure 1: Adding the JobTitle and JobTitles variables to your SSIS package
Because I created the variables at a package scope, theyll be available to all
components in my control flow. I could have waited to create the JobTitle variable until
after I added the Foreach Loop container, then I could have configured the variable at
the scope of the container. Ive seen it done both ways, and Ive done it both ways. Keep
in mind, however, if you plan to use the variable outside of the Foreach Loop container,
make sure it has a package scope.

Configuring Your Control Flow


The first step in configuring the control flow is to add a connection manager to the
AdventureWorks2008R2 database. In this case, I create a connection to the database on
a local instance of SQL Server 2008 R2 and then name the connection manager
AdventureWorks2008R2.
Next, I add an Execute SQL task to my control flow in order to retrieve a list of job titles
from the vEmployee view. After I add the task, I open the tasks editor and update the
value of the ResultSet property to Full result set. I use this setting because the task will
return a result set that contains data from the vEmployee view. I then specify the
AdventureWorks2008R2 connection manager in the Connection property, and assign the
following Transact-SQL statement to the SQLStatement property:

SELECT DISTINCT JobTitle


FROM HumanResources.vEmployee
WHERE JobTitle LIKE '%technician%'
My goal is to return a list of unique job titles that include the word technician. Figure 2
shows the General page of the Execute SQL Task editor after I add the Select statement.

Figure 2: Configuring the General page of the Execute SQL Task editor
Because the Execute SQL task has been set up to return a result set, you need some
place to put those results. Thats where the JobTitles variable comes in. The task will
pass the result set to the variable as an ADO object, which is why the variable has to be
configured with the Object data type. The variable can then be used to provide those
results to the Foreach Loop container.
So the next step in configuring the Execute SQL task is to map the JobTitles variable to
the result set, which I do on the Result Set page of the Execute SQL Task editor, shown
in Figure 3.

Figure 3: Configuring the Result Set page of the Execute SQL Task editor
To create the mapping, I click Add and then specify the JobTitles variable in the first row
of the VariableName column. Notice in the figure that I include the User namespace,
followed by two colons. I then set the value in the Result Name column to 0.
Thats all you need to do to configure the Execute SQL task. The next step is to add a
Foreach Loop container and connect the precedence constraint from the Execute SQL
task to the container. Then you can configure the container. When doing so, you must
select an enumerator type. The enumerator type indicates the type of collection youre
working with, such as files in a folder or rows in a table. In this case, because the result
set is stored in the JobTitles variable as an ADO object, I select the ForeachADO
enumerator, as shown in Figure 4.

Figure 4: Configuring the Collection page of the Foreach Loop editor


The ForeachADO enumerator lets you access rows of data in a variable configured with
the Object data type. So once I select the enumerator type, I select the JobTitles
variable from the ADO object source variable drop-down list. As for the
Enumerationmode option, I leave that at its default setting, Rowsinthefirsttable,
because theres only one table (with only one column).
After you configure the Foreach Loop container with the collection, you must create a
variable mapping that tells the container which variable to use to store the individual
member during each loop. You configure variable mappings on the Variable Mappings
page of the Foreach Loop editor, as shown in Figure 5.

Figure 5: Configuring the Variable Mappings page of the Foreach Loop editor
For my example, I create a mapping to the JobTitle variable. To do this, I select the
variable from the drop-down list in the first row of the Variable column, and set the
index to 0. I use 0 because my collection is taken from the first column of the result set
stored in the JobTitles variable. If there were more columns, the number would depend
on the column position. The positions are based on a 0 cardinality, so the first column
requires a 0 value in the Index column. If my result set included four columns and I was
using the third column, my Index value would be 2.
Thats all there is to setting up the Foreach Loop container for this example. After I
complete the setup, I add a Data Flow task to the container. My control flow now looks
similar to the one shown in Figure 6.

Figure 6: Setting up the control flow in your SSIS package


When you run the package, the Execute SQL task will retrieve a list of technician-related
job titles and save that list to the JobTitles variable. The Foreach Loop container will
iterate through the values in the variable. During each loop, the current job title will be
saved to the JobTitle variable, and the container will execute any tasks or containers
within the Foreach Loop container. In this case, its the Data Flow task. That means, for
each technician-related job title, the Data Flow task will be executed. So lets look at
how to configure that task.

Configuring Your Data Flow


As you probably know, you edit the Data Flow task on the Data Flow tab of SSIS
designer. For this example, I start by adding an OLE DB source component and opening
its editor, as shown in Figure 7.

Figure 7: Configuring the OLE DB source


The first thing to notice is that I specify the AdventureWorks2008R2 connection
manager in the OLE DB Connection manager drop-down list. I then select SQL Command
from the Data access mode drop-down list and add the following Select statement to
the SQL command text box:

SELECT FirstName, LastName, JobTitle


FROM HumanResources.vEmployee
WHERE JobTitle= ?
The statement retrieves employee data from the vEmployee view. Notice that the
WHERE clause includes a parameter placeholder (?) to indicate that a parameter value
should be passed into the clause. Because Ive included the parameter, I must now map
it to a variable that can provide the necessary value to the WHERE condition. In this
case, that variable is JobTitle, which will contain the job title associated with the current
iteration of the Foreach Loop container.

NOTE: The query actually need only retrieve data from the FirstName and
LastName columns. However, I also included that JobTitle column simply as a
way to verify that the data populating the CSV files is the correct data.

To map the parameter to the variable, click the Parameters button on the Connection
Manager page of the OLE DB Source editor. The button is to the right of where you add
your Select statement. This launches the Set Query Parameters dialog box, shown in
Figure 8.

Figure 8: Mapping the JobTitle variable to the parameter in the SELECT


statement
All I need to do to map the variable to the parameter is to select the variable from the
drop-down list in the Variables column in the first row. Once this is done, Ive completed
configuring the OLE DB source and can now add my next component to the data flow: a
Flat File destination.
There is nothing at all to configuring the destination itself. I simply add it to the data
flow and connect the data flow path from the OLE DB source to the destination
component. I then open the destinations editor and specify that a new Flat File
connection manager be created. This launches the Flat File Connection Manager editor,
shown in Figure 9.

Figure 9: Configuring a Flat File connection manager


I stick with all the default settings for the connection manager, except that I add the
following file path to the File Name text box: C:\DataFiles\JobTitle.csv. I then verify that
the columns are mapped correctly (on the Columns page). Once Ive configured the
connection manager, my package is about ready to goexcept for one important step.
The way the Flat File connection manager is currently configured, it will try to insert all
data into the JobTitle.csv file. That means, each time the Foreach Loop container runs
the Data Flow task, the job titles from the previous iteration will be overwritten, and the
file will contain only those technicians with the job title associated with the final loop.
However, one of the goals of this package is to create a file for each job title. That
means we need to modify the Flat File connection manager by creating a property
expression that changes the filename with each loop, based on the current value of the
JobTitle variable.
The easiest way to create the property expression is to open the Properties pane for the
connection manager and add a property expression, as shown in Figure 10.

Figure 10: Defining a property expression on your Flat File connection


manager
To create a unique file with each loop, I define a property expression for the
ConnectionString property. The expression itself concatenates the file path with the
JobTitle variable and the .csv file extension:

"C:\\DataFiles\\" + @[User::JobTitle] + ".csv"


Notice that I have to escape the backslashes in the file path by using an additional
backslash for each one. Now when I run the package, the current value in the JobTitle
variable provides the actual file name when that file is saved to the target folder, thus
creating a file for each job title. My data flow is now complete and looks similar to the
one shown in Figure 11.

Figure 11: Setting up the data flow in your SSIS package


If youve been creating your own SSIS package as youve been working through this
article, that package should now be ready to run. At this point, you might find it handy
to add a breakpoint to the control flow so you can monitor the JobTitle variable as its
value changes with each loop. If you do this, be sure to set the breakpoint on the Data
Flow task, not the Foreach Loop container itself. The container runs only once, but the
task runs multiple times, so thats where youll see the variable value changing.

Bulk insert task:

Bulk insert task is used to copy large amount of data into SQL Server tables from text files. For
example, imagine a data analyst in your organization provides a feed from a mainframe system to you
in the form of a text file and you need to import this into a SQL server table. The easiest way to
accomplish this is in SSIS package is through the bulk insert task.
Configuring Bulk Insert Task
Drag the bulk insert task from the toolbox into the control flow window.

Double click on the bulk insert task to open the task editor. Click on connections in left tab.

In the connections tab, Specify the OLE DB connection manager to connect to the destination SQL
Server database and the table into which data is inserted. Also, specify Flat File connection manager
to access the source file. Select The column and row delimiters used in the flat file.

Click on the Options in the left tab of the editor, and select the Code page the file, starting row
number (First row). Also Specify actions to perform on the destination table or view when the task
inserts the data. The options are to check constraints, enable identity inserts, keep nulls, fire triggers,
or lock the table.

On running the package the data will get be copied from the source to the destination. Bulk Insert
doesnt have an option to truncate and load; hence you must use an Execute SQL Task to delete the
data already present in the table before loading flat file data.
It is an easy to use and configure task but with few cons.
1.

It only allows to append the data into the table and you cannot perform truncate and load.

2.

Only Flat file can be used as source and not any other type of databases.

3.

Only SQL Server Databases can be used as destination. It doesnt support any other files/
RDBMS systems.

4.

A failure in the Bulk Insert task does not automatically roll back successfully loaded batches.

5.

Only members of the SYSADMIN fixed server role can run a package that contains a Bulk
Insert task.

Execute Sql Task:

RowCount for Execute SQL Task

Case
How do you get a rowcount when you execute an Insert, Update or Delete query with an Execute SQL

Task? I want to log the number of effected rows just like in a Data Flow.

Solution
The Transact-SQL function @@ROWCOUNT can help you here. It returns the number of rows affected
by the last statement.
1) Variable
Create an integer variable named 'NumberOfRecords' to store the number of affected rows in.

Right click to show variables

2) Execute SQL Task


Put an Execute SQL Task on your Control Flow. We are going to update some records.

Give it a suitable name.

3) Edit Execute SQL Statement


On the general tab, change the resultset to Single Row and select the right connection (this function only
works for SQL Server).

Resultset: Single Row

4) SQLStatement
Enter your query, but add the following text at the bottum of your query: SELECT @@ROWCOUNT as
NumberOfRecords; This query will return the number of affected rows in the column NumberOfRecords.

See the @@ROWCOUNT function

5) Result Set
Go to the Result Set tab and change the Result Name to NumberOfRecords. This is the name of the
column. Select the variable of step 1 to store the value in.

Result Set

6) The Result
To show you the value of the variable with the number of affected records, I added a Script Task with a
simple messagebox. You can add your own logging. For example a Script Task that fires an event or an
Execute SQL Task that inserts some logging record.

The Result

Configurations:

Setting Up Your XML Configuration File


After youve set up your package, the first step in setting up the XML configuration
file is to enable package configurations. To do so, click the Package
Configurations option on the SSIS menu. This launches the Package Configuration
Organizer, shown in Figure 4.

Figure 4: The Package Configuration Organizer in SSIS


To enable package configurations on your package, select the Enable package
configurations checkbox. You can then add your package configurations to the
package. To do so, click Add to launch the Package Configuration wizard. When the
wizard appears, click Next to skip the Welcome screen. The Select Configuration
Type screen will appear, as shown in Figure 5.

Figure 5: The Select Configuration Type screen in the Package Configuration wizard
From the Configuration type drop-down list, select XML configuration file. You
can then choose to specify your configuration settings directly or specify a Windows
environment variable that stores the path and file names for the configuration file.
For this example, I selected the Specify configuration settings directly option
and specified the following path and file name:
C:\Projects\SsisConfigFiles\LoadPersonData.dtsConfig. The main thing to notice is
that the file should use the extension dtsConfig.
NOTE: If you specify an XML file that already exists, youll be prompted whether to
use that file or whether to overwrite the files existing settings and use the
packages current settings. If you use the files settings, youll skip the next screen,
otherwise, the wizard will proceed as if the file had not existed. Also, if you choose

to use an environment variable to store the path and file names, the wizard will not
create a configuration file and will again skip the next screen. Even if you use an
environment variable, you might want to create the file first and then select the
environment variable option afterwards.
The next screen in the wizard is Select Properties to Export. As the name
implies, this is where you select the properties for which you want package
configurations. In this case, I selected the Value property for the ConnectMngr
variable and the ServerName property for each of the two connections managers,
as shown in Figure 6.

Figure 6: Selecting properties in the Package Configuration wizard


Because I chose three properties, three package configurations will be created in
the XML file. You can choose as many properties as you want to add to your file.
On the next screen of the Package Configuration wizard, you provide a name for the
configuration and review the settings (shown in Figure 7).

Figure 7: The Completing the Wizard screen in the Package Configuration wizard
If youre satisfied with the settings, click Finish. The wizard will automatically
generate the XML configuration file and add the properties that youve specified.
The file will also be listed in the Package Configuration Organizer, as shown in Figure
8.

Figure 8: The XML package configuration as its listed in the Package Configuration
Organizer
NOTE: When you add an XML configuration file, no values are displayed in the
Target Object and Target Property columns of the Package Configuration Organizer.
This is because XML configuration files support multiple package configurations.
You should also verify whether the XML package configuration file has been created
in the specified location. For this example, I added the file to the
C:\Projects\SsisConfigFiles\ folder. The file is automatically saved with the dtsConfig
extension. If you open the file in a text editor or browser, you should see the XML
necessary for a configuration file. Figure 9 shows the LoadPersonData.dtsConfig file
as it appears in Internet Explorer.

Figure 9: The XML in the LoadPersonData.dtsConfig file

As Figure 9 shows, the XML configuration file includes the


<DTSConfigurationHeading> element. The element contains the attributes and
their values that define when, who, and how the file was generated. The file also
includes one <Configuration> element for each package configuration. Each
<Configuration> element includes the attributes and their values necessary to
determine which property is being referenced. Within each <Configuration>
element is a nested <ConfiguredValue> element, which provides the propertys
actual value.
Notice that the property values are the same as that of the package itself. When you
first set up an XML configuration file, the current package value is used for each
property. You can, of course, change those values, as I demonstrate later in the
article.
Running Your SSIS Package
After youve created your XML configuration file, youre ready to run your package.
You run the package as you would any other SSIS package. However, because
package configurations have been enabled, the package will check for any settings
that have been predefined.
For the example Ive been demonstrating here, the package will run as if nothing
has changed because, as stated above, the XML configuration file contains the
same values as the properties initially defined on the package. That means the
ConnectMngr variable will still have a value of Server A, and the connection
managers will still point to the same SQL Server computer. Figure 10 shows the
package after it ran without modifying the XML configuration file.

Figure 10: Running the LoadPersonData package with the default settings
As you would expect, the Server A data flow ran, but not the Server B data flow.
However, the advantage to using XML configuration files is that you can modify
property settings without modifying the package itself. When the package runs, it
checks the configuration file. If the file exists, it uses the values form the listed
properties. That means if I change the property values in the file, the package will
use those new values when it runs.
For instance, if I change the value of the ConnectMngr variable from Server A to
Server B, the package will use the value. As a result, the precedence constraint
that connects to the Server A Data Flow task will evaluate to False, and the
precedence constraint that connects to the Server B Data Flow task will evaluate to
True, and the Server B data flow will run. Figure 11 shows what happens if I change
the variables value in the XML configuration file to Server B.

Figure 11: Running the Server B Data Flow task in the LoadPersonData SSIS package
As you would expect, the Server B Data Flow task ran, but not the Server A Data
Flow task. If I had changed the values of the ServerName properties for the
connection managers, my source and destination servers would also have been
different.
Clearly, XML configuration files offer a great deal of flexibility for supplying property
values to your packages. They are particularly handy when deploying your packages
to different environments. Server and instance names can be easily changed, as
can any other value. If you hard-code the path and file name of the XML
configuration file into the package, as Ive done in this example, then you must
modify the package if that file location or name changes. You can get around this by
using a Windows environment variable, but thats not always a practical solution. In
addition, you can override the configuration path and file names by using the
/CONFIGURATION option with the DTExec utility.
Whatever approach you take, youll find XML configuration files to be a useful tool
that can help streamline your development and deployment efforts. Theyre easy to
set up and maintain, and well worth the time it takes to learn how to use them and
how to implement them into your solutions.
Debugging and Logging

SQL Server Business Intelligence Development Studio (BIDS) provides several tools
you can use to troubleshoot the data flow of a SQL Server Integration Services
(SSIS) package. The tools let you sample a subset of data, capture data flow row
counts, view data as it passes through data paths, redirect data that generates
errors, and monitor package execution. You can use these tools for any package
that contains a data flow, regardless of the datas source or destination or what
transformations are being performed.
The better you understand the debugging tools, the more efficiently you can
troubleshoot your data flow. In this article, I demonstrate how each debugging tool
works. To do so, I set up a test environment that includes a comma-separated text
file, a table in a SQL Server database, and an SSIS package that retrieves data from
the text file and inserts it into the table. The text file contains data from the
Person.Person table in the AdventureWorks2008R2 database. To populate the file, I
ran the following bcp command:
bcp "SELECT TOP 10000 BusinessEntityID, FirstName, LastName FROM
AdventureWorks2008R2.Person.Person ORDER BY BusinessEntityID" queryout
C:\DataFiles\PersonData.txt -c -t, -S localhost\SqlSrv2008R2 T
After I created the file, I manipulated the first row of data in the file by extending
the LastName value in the first row to a string greater than 50 characters. As youll
see later in the article, I did this in order to introduce an error into the data flow so I
can demonstrate how to handle such errors.
Next I used the following Transact-SQL script to create the PersonName table in the
AdentureWorks2008R2 database:
USE AdventureWorks2008R2
GO
IF OBJECT_ID('dbo.PersonName') IS NOT NULL
DROP TABLE dbo.PersonName
GO
CREATE TABLE dbo.PersonName
(
NameIDINT PRIMARY KEY,
FullNameNVARCHAR(110) NOT NULL
)
After I set up the source and target, I created an SSIS package. Initially, I configured
the package with the following components:

A connection manager to the AdventureWorks2008R2 database.


A connection manager to the text file with the source data.

An Execute SQL task that truncates the PersonName table.

A Data Flow task that retrieves data from the text file, creates a derived
column, and inserts the data into the PersonName table.

Figure 1 shows the data flow components I added to the package, including those
components related to troubleshooting the data flow.
NOTE: You can download the SSIS package from the speech bubble at the top of the
article.

Figure 1: Setting up the data flow in the sample SSIS package


The data flow components specific to processing the Person data are the OLE DB
Source, Derived Column, and OLE DB Destination components. The Derived Column
transformation concatenates the first and last names into a single column named
FullName. The other components in the data flow are specific to debugging and are
discussed in detail in the rest of the article.
Working with a Data Sample

When youre developing an SSIS package that retrieves large quantities of data, it
can be helpful to work with only a subset of data until youve resolved any issues in
the data flow. SSIS provides two data flow components that let you work with a
randomly selected subset of data. The Row Sampling Transformation component
lets you specify the number of rows you want to include in your random data
sample, and the Percentage Sampling Transformation component lets you specify
the percentage of rows.
Both components support two data outputs: one for the sampled data and one for
the unsampled data. Each component also lets you specify a seed value so that the
samples are the same each time you run the package. (The seed value is tied to the
operating systems tick count.) When you dont specify a seed value, the data
sample is different each time you run the data flow.
If you refer back to Figure 1, youll see that I added a Row Sampling Transformation
component right after the Flat File Source component. Figure 2 shows the Row
Sampling Transformation Editor. Notice that I configured the component to retrieve
1000 rows of sample data, but I did not specify a seed value.

Figure 2: Selecting a data sample from the data flow


If you want, you can name the outputs for the sample and non-sample data. In this
case, Ive left the default names and used the Sampling Selected Output data path
to connect to the next component in the data flow. Now the data flow will include
only the random 1000 rows.
Verifying Row Counts

When data passes through a data flow, the SSIS design surface displays the number
of rows passing along each data path. The count changes as data moves through
the pipeline. After the package has finished executing, the number displayed is the
total number of rows that passed through the data path in the last buffer. If there
were multiple buffers, the final number would not provide an accurate count.
However, you can add a Row Count Transformation component to the data flow. The
transformation provides a final count that adds together the rows from all buffers
and stores the final count in a variable. This can be useful when you want to ensure
that a particular point in the data flow contains the number of rows you would
expect. You can then compare that number to the number of rows in your source or
destination.
To retrieve the row count from the variable, you can use whatever method you like.
For instance, you can create an event handler that captures the variable value and
saves it to a table in a SQL Server database. How you retrieve that value is up to
you. The trick is to use the Row Count Transformation component to capture the
total rows and save them to the variable.
In my sample SSIS package, I created a string variable named RowCount, then, after
the Derived Column component, I added a Row Count Transformation component.
Figure 3 shows the components editor. The only step I needed to take to configure
the editor was to add the variable name to the VariableName property.

Figure 3: Verifying the row counts of data passing along a data path
When the package runs, the final count from that part of the data flow will be saved
to the RowCount variable. I verified the RowCount value by adding a watch to the
control flow, but in an actual development environment, youd probably want to
retrieve the value through a mechanism such as an event viewer, as mentioned
above, so you have a record you can maintain as long as necessary.
Adding Data Viewers to the Data Path
When troubleshooting data flow, it can be useful to view the actual data as it passes
through a data path. You can do this by adding one or more data viewers to your
data flow. SSIS supports several types of data viewers. The one most commonly
used is the grid data viewer, which displays the data in tabular format. However,
you can also create data viewers that display histograms, scatter plot charts, or
column charts. These types of data viewers tend to be useful for more analytical

types of data review, but for basic troubleshooting, the grid data viewer is often the
best place to start.
To create a grid data viewer, open the editor for the data path on which you want to
view the data, then go to the Data Viewers page, as shown in Figure 4.

Figure 4: Editing the properties of the data flow path


The Data Flow Path editor is where you add your data viewers, regardless of the
type. To add a data viewer, click the Add button to launch the Configure Data
Viewer dialog box, shown in Figure 5. Here you select the type of viewer you want
to create and provide a name for that viewer.

Figure 5: Creating a grid data viewer on a data path


After you select the Grid option from the Type list and provide a name, go to the
Grid tab, shown in Figure 6. This is where you determine what columns you want to
include in the grid. At this point, were interested only the BusinessEntityID and
FullName columns because those are the columns in our target table.

Figure 6: Configuring a grid data viewer


After you specify the columns to include in the grid, click OK. Youll be returned to
the Data Flow Path Editor. The new grid data viewer should now be displayed in the
Data Viewers list. In addition, a small icon is added next to the data path (shown
in Figure 1).
When you debug a package in which a data viewer has been defined, the package
will stop running at the viewers data path and a window will appear and display the
data in that part of the data flow. Figure 7 shows the grid data viewer I configured
on my data flow.

Figure 7: Viewing sample data through a grid data viewer


Notice that the data viewer displays the BusinessEntityID and FullName values for
each row. You can scroll down the list, detach the viewer from the data flow, resume
the data flow, or copy the data to the clipboard. The data itself and the ultimate
outcome of the package are unaffected.
Configuring Error-Handling on the Components
Many data flow components let you specify how to handle data that might generate
an error. By default, if data causes an error, the component fails; however, you can
configure some components to redirect problem rows. For instance, if you refer back
to Figure 1, youll see that the Flat File Source has an additional data path output,
which is red. You can use the red data path to capture any bad rows outputted by
the component, when the component is properly configured.

I connected the red data path to a Flat File Destination component so I can store
rows the generate errors to a text file. When you connect an error output to another
component, the Configure Error Output dialog box appears, as shown in Figure 8.
Notice that for each column, you can configure what action to take for either errors
or truncations. An error might be something like corrupt data or an incorrect data
type. A truncation occurs if a value is too long for the configured type. By default,
each column is configured to fail the component whether there is an error or
truncation.

Figure 8: Configuring a data flow component to redirect rows


You can override the default behavior by specifying that the row be redirected. In
this case, I chose to redirect all columns whether there was an error or truncation.
To do so, I changed the Error and Truncation options for each row and column to
Redirect row. Next, I configured the Flat File Destination component with a new
data source that points to a text file that will be used to capture the outputted rows,
if there are any errors or truncations. As youll recall from earlier in the article, I
modified the last name in the first row of the source file by making the last name
too long. As a result, I would expect the first row to fail and be redirected to the new
error file.
When you configure the destination component and connection manager, youll
notice that one column is created for the outsourced row, one column for the
numeric error code, and one column for the identifier of the source column that
generates the error. When a row is redirected to the error output, it is saved to the
error file, along with the error number and column identifier. The values in the
redirected row are separated by commas, but treated as one value.

Monitoring Package Execution


The final tools for troubleshooting the data flow are related to the package
execution and SSIS design surface. When a package is running, you can watch the
data flow to see what is happening with each component. Row counts are displayed
next to the data paths and the components change colors as theyre being
executed. By observing these colors, you can observe the state of execution:

White. Component has not yet been executed.


Yellow. Component is currently extracting, transforming, or loading data.

Green. Component has completed its operation.

Red. Component generated errors and package execution stopped.

Of course, if a component turns red, you have a problem. But sometimes a


component will turn yellow and hang there. In which case, you still have a problem.
However, if everything is running fine, the components will first turn yellow and
then green, as shown in Figure 9.

Figure 9: Viewing the data flow progress on the design surface


Notice that the number of rows that passed through the data paths during the last
buffer show up on the design surface. As you can see, one row has been redirected
to the error file. Also, there are 9,999 rows in the data path that leads to the Row
Sampling transformation, but only 1,000 rows after the transformation.
If an execution is not successful (red or hanging yellow), you should refer to the
Progress tab for information about the package execution. There you can find
details about each component and the data that is flowing through those
components. Figure 10 shows the Progress tab after I finished running my package.

Figure 10: Viewing the Progress tab during package execution


Notice that the Progress tab shows details about the Data Flow task and its data
pipeline. The details shown here are only part of the displayed information. You
need to scroll down to view the rest. However, as you can see, there are several
warning messages, along with all the information messages. In this case, the
warning messages indicate that the unsampled data is not being used, as we
already knew. But some warnings can be useful information to have. In addition, the
Progress tab also displays error messages, along with all the other events that are
fired during execution.
The Data Flow Debugging Tools

You might not need to use all the tools that SSIS provides for debugging your data
flow, but whatever tools you do implement can prove quite useful when trying to
troubleshoot an issue. By working with data samples, monitoring row counts, using
data viewers, configuring error-handling, and monitoring package execution, you
should be able to pinpoint where any problems might exist in your data flow. From
there, you can take the steps necessary to address those problems. Without the
SSIS troubleshooting tools, locating the source of the problem can take an
inordinate amount of time. The effort you put in now to learn how to use these tools
and take advantage of their functionality can pay off big every time you run an SSIS
package.
Steps to configure logging
Open the package in Business Intelligence Development Studio (BIDS), see that you
are in the design mode. When you are in the Control Flow, right click (do not right
click on the control flow tasks) and select Logging from the drop menu displayed
(picture below).

A dialog box Configure SSIS Logs is displayed. In the left hand pane, there is a tree
view is displayed. Select the package by selecting the check box corresponding to it
(picture below). You can check individual tasks also.

Upon selecting the package or task, you can then configure logging through the
available logging providers from the drop down list as shown below. You can add
multiple logs of the same type and/or another type. In our example we will look at
selecting only one log provider and that is SSIS log provider for Text Files. After
selecting the log provider, click on Add button.

Once the Log type is selected and added, the dialog box looks like the picture below.
Choose the log file by selecting the check box to the left of it and go to
configuration column to configure the location of the log file in our example it is a
text file.

There would be a drop down list when you go to the configuration column, under
which you would get a <new connection> listed, choose that and it will open a
small window which would be similar to the one shown below.

Choose create file in the usage type and click browse button.. It would open a dialog
box and we need to navigate to the directory where the SSIS package log file will be
created. I am choosing the default Log directory of that instance here. (picture
below)

After choosing the location and the name of the file to be used, select Open button
in the current dialog box that would take back to the previous dialog, select OK to
configure the file location. Now we are all set, except the events that would be
logging into this log file. To select the events, switch to the details tab as show
below. Choose the events which needs to be logged into the log file. Choosing the
events selectively is important, since we do not want too much of information is
written into the log file, making it difficult to find information when needed. I always
choose OnError and OnTaskFailed events for every task and some additional events
in case of Data Flow tasks.

Continue to click a series of OK buttons to have the logging configured.

SSIS Package Deployment Utility


For deploying packages created by business analysis developers in SQL Server
Integration Services, Business Intelligent development studio introduces Package
Deployment Utility. This deployment utility can create a deployment package by means
of which you can deploy your package in:
1. File System
2. SQL Server job
Just take properties of your project and go to Deployment Utility tab.

Set the "Create Deployment Utility" as "True" and specify the "Deployment Path".
As soon as you build your project deployment utility is created in the above specified
folder with the package file. The file type of Deployment Utility is "Integration Services
Deployment Manifest". The extension of the deployment package is
"*.SSISDeploymentManifest".
When you run this manifest file. The package deployment wizard is started which helps
in deploying the package.

As discussed above, you can also specify the deployment destination for our SSIS
package.

If you choose to install in the file system then you just have to specify the destination
folder and start the wizard. If you choose otherwise and install in the SQL Server
instance, then you have to specify the SQL Server instance in which we want to install
this package.

Security:

ProtectionLevel is an SSIS package property that is used to specify how sensitive


information is saved within the package and also whether to encrypt the package or the
sensitive portions of the package. The classic example of sensitive information would be a
password. Each SSIS component designates that an attribute is sensitive by including
Sensitive="1" in the package XML; e.g. an OLE DB Connection Manager specifies that the
database password is a sensitive attribute as follows:
<DTS:PasswordDTS:Name="Password" Sensitive="1">

When the package is saved, any property that is tagged with Sensitive="1" gets handled per
the ProtectionLevel property setting in the SSIS package. The ProtectionLevel property can
be selected from the following list of available options (click anywhere in the design area of
the Control Flow tab in the SSIS designer to show the package properties):

DontSaveSensitive
EncryptSensitiveWithUserKey

EncryptSensitiveWithPassword

EncryptAllWithPassword

EncryptAllWithUserKey

ServerStorage

To show the effect of the ProtectionLevel property, add an OLE DB Connection Manager to
an SSIS package:

The above connection manager is for a SQL Server database that uses SQL Server
authentication; the password gives the SSIS package some sensitive information that must
be handled per the ProtectionLevel package property.
Now let's discuss each ProtectionLevel setting using an SSIS package with the above OLE
DB Connection Manager added to it.

DontSaveSensitive
When you specify DontSaveSensitive as the ProtectionLevel, any sensitive information is
simply not written out to the package XML file when you save the package. This could be
useful when you want to make sure that anything sensitive is excluded from the package
before sending it to someone. After saving the package using this setting, when you open it
up and edit the OLE DB Connection Manager, the password is blank even though the Save
my password checkbox is checked:

EncryptSensitiveWithUserKey
EncryptSensitiveWithUserKey encrypts sensitive information based on the credentials of the
user who created the package; e.g. the password in the package XML would look like the
following (actual text below is abbreviated to fit the width of the article):
<DTS:PASSWORD Sensitive="1" DTS:Name="Password"
Encrypted="1">AQAAANCMnd8BFdERjHoAwE/Cl+...</DTS:PASSWORD>

Note that the package XML for the password has the attribute Encrypted="1"; when the
user who created the SSIS package opens it the above text is decrypted automatically in
order to connect to the database. This allows the sensitive information to be stored in the
SSIS package but anyone looking at the package XML will not be able to decrypt the text
and see the password.
There is a limitation with this setting; if another user (i.e. a different user than the one who
created the package and saved it) opens the package the following error will be displayed:

If the user edits the OLE DB Connection Manager, the password will be blank. It is important
to note that EncryptSensitiveWithUserKey is the default value for the ProtectionLevel
property. During development this setting may work okay. However, you do not want to
deploy an SSIS package with this setting, as only the user who created it will be able to
execute it.

EncryptSesnitiveWithPassword
The EncryptSensitiveWithPassword setting for the ProtectionLevel property requires that you
specify a password in the package, and that password will be used to encrypt and decrypt
the sensitive information in the package. To fill in the package password, click on the button
in the PackagePassword field of the package properties as shown below:

You will be prompted to enter the password and confirm it. When opening a package with a
ProtectionLevel of EncryptSensitiveWithPassword, you will be prompted to enter the
password as shown below:

The EncryptSensitiveWithPassword setting for the ProtectionLevel property overcomes the


limitation of the EncryptSensitiveWithUserKey setting, allowing any user to open the
package as long as they have the password.
When you execute a package with this setting using DTEXEC, you can specify the password
on the command line using the /Decrypt password command line argument.

EncryptAllWithPassword
The EncryptAllWithPassword setting for the ProtectionLevel property allows you to encrypt
the entire contents of the SSIS package with your specified password. You specify the
package password in the PackagePassword property, same as with the
EncryptSensitiveWithPassword setting. After saving the package you can view the package
XML as shown below:

Note that the entire contents of the package is encrypted and the encrypted text is shown in
the CipherValue element. This setting completely hides the contents of the package. When
you open the package you will be prompted for the password. If you lose the password
there is no way to retrieve the package contents. Keep that in mind.
When you execute a package with this setting using DTEXEC, you can specify the password
on the command line using the /Decrypt password command line argument.

EncryptAllWithUserKey

The EncryptAllWithUserKey setting for the ProtectionLevel property allows you to encrypt
the entire contents of the SSIS package by using the user key. This means that only the
user who created the package will be able open it, view and/or modify it, and run it. After
saving a package with this setting the package XML will look similar to this:

Note that the entire contents of the package are encrypted and contained in the Encrypted
element.

ServerStorage
The ServerStorage setting for the ProtectionLevel property allows the package to retain all
sensitive information when you are saving the package to SQL Server. SSIS packages saved
to SQL Server use the MSDB database. This setting assumes that you can adequately secure
the MSDB database and therefore it's okay to keep sensitive information in a package in an
unencrypted form

Scheduling:

SSIS package can be scheduled in SQL Agent Jobs. Here is quick note on how one can do
the same.
First you can create new job from SQL Server Agent Menu.

Create New Step.

Select Type as SQL Server Integration Services Packages. Select Package Source as file
system and give package path.

Now click on OK, which will bring you at following screen.

On next screen you can select schedule and configure desired schedule.

You can notice this is very easy process. Let me know if you have any further questions.

You might also like