Ssis

Problem
When preparing for a SQL Server interview, it is helpful to understand what questions may be asked related to SSIS. In this tip series, I will try to cover as much as I can to help you prepare.
Solution
What is SQL Server Integration Services (SSIS)? SQL Server Integration Services (SSIS) is component of SQL Server 2005 and later versions. SSIS is an enterprise scale ETL (Extraction, Transformation and Load) tool which allows you to develop data integration and workflow solutions. Apart from data integration, SSIS can be used to define workflows to automate updating multidimensional cubes and automating maintenance tasks for SQL Server databases.
How does SSIS differ from DTS? SSIS is a successor to DTS (Data Transformation Services) and has been completely re-written from scratch to overcome the limitations of DTS which was available in SQL Server 2000 and earlier versions. A significant improvement is the segregation of the control/work flow from the data flow and the ability to use a buffer/memory oriented architecture for data flows and transformations which improve performance.
What is the Control Flow? When you start working with SSIS, you first create a package which is nothing but a collection of tasks or package components. The control flow allows you to order the workflow, so you can ensure tasks/components get executed in the appropriate order.
What is the Data Flow Engine? The Data Flow Engine, also called the SSIS pipeline engine, is responsible for managing the flow of data from the source to the destination and performing transformations (lookups, data cleansing etc.). Data flow uses memory oriented architecture, called buffers, during the data flow and transformations which allows it to execute extremely fast. This means the SSIS pipeline engine pulls data from the source, stores it in buffers (in-memory), does the requested transformations in the buffers and writes to the destination. The benefit is that it provides the fastest transformation as it happens in memory and we don't need to stage the data for transformations in most cases.
What is a Transformation? A transformation simply means bringing in the data in a desired format. For example you are pulling data from the source and want to ensure only distinct records are written to the destination, so duplicates are removed. Anther example is if you have master/reference data and want to pull only related data from the
source and hence you need some sort of lookup. There are around 30 transformation tasks available and this can be extended further with custom built tasks if needed. What is a Task? A task is very much like a method of any programming language which represents or carries out an individual unit of work. There are broadly two categories of tasks in SSIS, Control Flow tasks and Database Maintenance tasks. All Control Flow tasks are operational in nature except Data Flow tasks. Although there are around 30 control flow tasks which you can use in your package you can also develop your own custom tasks with your choice of .NET programming language.
What is a Precedence Constraint and what types of Precedence Constraint are there? SSIS allows you to place as many as tasks you want to be placed in control flow. You can connect all these tasks using connectors called Precedence Constraints. Precedence Constraints allow you to define the logical sequence of tasks in the order they should be executed. You can also specify a condition to be evaluated before the next task in the flow is executed. These are the types of precedence constraints and the condition could be either a constraint, an expression or both o Success (next task will be executed only when the last task completed successfully) or o Failure (next task will be executed only when the last task failed) or o Complete (next task will be executed no matter the last task was completed or failed).
What is a container and how many types of containers are there? A container is a logical grouping of tasks which allows you to manage the scope of the tasks together. These are the types of containers in SSIS: o Sequence Container - Used for grouping logically related tasks together o For Loop Container - Used when you want to have repeating flow in package o For Each Loop Container - Used for enumerating each object in a collection; for example a record set or a list of files. Apart from the above mentioned containers, there is one more container called the Task Host Container which is not visible from the IDE, but every task is contained in it (the default container for all the tasks).
What are variables and what is variable scope? A variable is used to store values. There are basically two types of variables, System Variable (like ErrorCode, ErrorDescription, PackageName etc) whose values you can use but cannot change and User Variable which you create, assign values and read as needed. A variable can hold a value of the data type you have chosen when you defined the variable. Variables can have a different scope depending on where it was defined. For example you can have package level variables which are accessible to all the tasks in the package and there could also be container level variables which are accessible only to those tasks that are within the container
Problem
When you are preparing for an SSIS interview you need to understand what questions could be asked in the interview. In this tip series, I will try to cover as much as I can to help you prepare for SSIS interview. In this tip we will cover SSIS basics and SSIS event logging.
Solution
What are SSIS Connection Managers? When we talk of integrating data, we are actually pulling data from different sources and writing it to a destination. But how do you get connected to the source and destination systems? This is where the connection managers come into the picture. Connection manager represent a connection to a system which includes data provider information, the server name, database name, authentication mechanism, etc. For more information check out the SQL Server Integration Services (SSIS) Connection Managers and Connection Managers in SQL Server 2005 Integration Services SSIS tips.
What is the RetainSameConnection property and what is its impact? Whenever a task uses a connection manager to connect to source or destination database, a connection is opened and closed with the execution of that task. Sometimes you might need to open a connection, execute multiple tasks and close it at the end of the execution. This is where RetainSameConnection property of the connection manager might help you. When you set this property to TRUE, the connection will be opened on first time it is used and remain open until execution of the package completes.
What are a source and destination adapters? A source adaptor basically indicates a source in Data Flow to pull data from. The source adapter uses a connection manager to connect to a source and along with it you can also specify the query method and query to pull data from the source. Similar to a source adaptor, the destination adapter indicates a destination in the Data Flow to write data to. Again like the source adapter, the destination adapter also uses a connection manager to connect to a target system and along with that you also specify the target table and writing mode, i.e. write one row at a time or do a bulk insert as well as several other properties. Please note, the source and destination adapters can both use the same connection manager if you are reading and writing to the same database.
What is the Data Path and how is it different from a Precedence Constraint? Data Path is used in a Data Flow task to connect to different components of a Data Flow and show transition of the data from one component to another. A data path contains the meta information of the data flowing through it, such as the columns, data type, size, etc. When we talk about differences between the data path and precedence constraint; the data path is used in the data flow, which shows the flow of data. Whereas the precedence constraint is used in control flow, which shows control flow or transition from one task to another task.
What is a Data Viewer utility and what it is used for? The data viewer utility is used in Business Intelligence Development Studio during development or when troubleshooting an SSIS Package. The data viewer utility is placed on a data path to see what data is flowing through that specific data path during execution. The data viewer utility displays rows from a single buffer at a time, so you can click on the next or previous icons to go forward and backward to display data. Check out the Data Viewer enhancements in SQL Server Denali.
What is an SSIS breakpoint? How do you configure it? How do you disable or delete it? A breakpoint allows you to pause the execution of the package in Business Intelligence Development Studio during development or when troubleshooting an SSIS Package. You can right click on the task in control flow, click on Edit Breakpoint menu and from the Set Breakpoint window, you specify when you want execution to be halted/paused. For example OnPreExecute, OnPostExecute, OnError events, etc. To toggle a breakpoint, delete all breakpoints and disable all breakpoints go to the Debug menu and click on the respective menu item. You can event specify different conditions to hit the breakpoint as well. To learn more about breakpoints, refer to Breakpoints in SQL Server 2005 Integration Services SSIS.
What is SSIS event logging? Like any other modern programming language, SSIS also raises different events during package execution life cycle. You can enable or write these events to trace the execution of your SSIS package and its tasks. You can also can write your custom message as a custom log. You can enable event logging at the package level as well as at the tasks level. You can also choose any specific event of a task or a package to be logged. This is essential when you are troubleshooting your package and trying to understand a performance problem or root cause of a failure. Check out this tip about Custom Logging in SQL Server Integration Services SSIS.
What are the different SSIS log providers? There are several places where you can log execution data generated by an SSIS event log: o SSIS log provider for Text files o SSIS log provider for Windows Event Log o SSIS log provider for XML files o SSIS log provider for SQL Profiler o SSIS log provider for SQL Server, which writes the data to the msdb..sysdtslog90 or msdb..sysssislog table depending on the SQL Server version.
How do you enable SSIS event logging? SSIS provides a granular level of control in deciding what to log and where to log. To enable event logging for an SSIS Package, right click in the control flow area of the package and click on Logging. In the Configure SSIS Logs window you will notice all the tasks of the package are listed on the left side of the tree view. You can
specifically choose which tasks you want to enable logging. On the right side you will notice two tabs; on the Providers and Logs tab you specify where you want to write the logs, you can write it to one or more log providers together. On the Details tab you can specify what events do you want to log for the selected task. Please note, enabling event logging is immensely helpful when you are troubleshooting a package, but also incurs additional overhead on SSIS in order to log the events and information. Hence you should only enabling event logging when needed and only choose events which you want to log. Avoid logging all the events unnecessarily.
What is the LoggingMode property? SSIS packages and all of the associated tasks or components have a property called LoggingMode. This property accepts three possible values: Enabled - to enable logging of that component, Disabled - to disable logging of that component and UseParentSetting - to use parent's setting of that component to decide whether or not to log the data.
Problem
When preparing for your next SSIS interview be sure to understand what could be questions asked in interview. In this tip series, I will try to cover as much as I can to help you prepare better for SSIS interview. This tip covers transactions, event handling and validations SQL Server Integration Services interview questions.
Solution
What is the transaction support feature in SSIS? When you execute a package, every task of the package executes in its own transaction. What if you want to execute two or more tasks in a single transaction? This is where the transaction support feature helps. You can group all your logically related tasks in single group. Next you can set the transaction property appropriately to enable a transaction so that all the tasks of the package run in a single transaction. This way you can ensure either all of the tasks complete successfully or if any of them fails, the transaction gets roll-backed too.
What properties do you need to configure in order to use the transaction feature in SSIS? Suppose you want to execute 5 tasks in a single transaction, in this case you can place all 5 tasks in a Sequence Container and set the TransactionOption and IsolationLevel properties appropriately. o The TransactionOption property expects one of these three values: Supported - The container/task does not create a separate transaction, but if the parent object has already initiated a transaction then participate in it Required - The container/task creates a new transaction irrespective of any transaction initiated by the parent object NotSupported - The container/task neither creates a transaction nor participates in any transaction initiated by the parent object
Isolation level dictates how two more transaction maintains consistency and concurrency when they are running in parallel. To learn more about Transaction and Isolation Level, refer to this tip.
When I enabled transactions in an SSIS package, it failed with this exception: "The Transaction Manager is not available. The DTC transaction failed to start." What caused this exception and how can it be fixed? SSIS uses the MS DTC (Microsoft Distributed Transaction Coordinator) Windows Service for transaction support. As such, you need to ensure this service is running on the machine where you are actually executing the SSIS packages or the package execution will fail with the exception message as indicated in this question.
What is event handling in SSIS? Like many other programming languages, SSIS and its components raise different events during the execution of the code. You can write an even handler to capture the event and handle it in a few different ways. For example consider you have a data flow task and before execution of this data flow task you want to make some environmental changes such as creating a table to write data into, deleting/truncating a table you want to write, etc. Along the same lines, after execution of the data flow task you want to cleanup some staging tables. In this circumstance you can write an event handler for the OnPreExcute event of the data flow task which gets executed before the actual execution of the data flow. Similar to that you can also write an event handler for OnPostExecute event of the data flow task which gets executed after the execution of the actual data flow task. Please note, not all the tasks raise the same events as others. There might be some specific events related to a specific task that you can use with one object and not with others.
How do you write an event handler? First, open your SSIS package in Business Intelligence Development Studio (BIDS) and click on the Event Handlers tab. Next, select the executable/task from the left side combo-box and then select the event you want to write the handler in the right side combo box. Finally, click on the hyperlink to create the event handler. So far you have only created the event handler, you have not specified any sort of action. For that simply drag the required task from the toolbox on the event handler designer surface and configure it appropriately. To learn more about event handling, click here.
What is the DisableEventHandlers property used for? Consider you have a task or package with several event handlers, but for some reason you do not want event handlers to be called. One simple solution is to delete all of the event handlers, but that would not be viable if you want to use them in the future. This is where you can use the DisableEventHandlers property. You can set this property to TRUE and all event handlers will be disabled. Please note with this property you simply disable the event handlers and you are not actually removing them. This means you can set this value to FALSE and the event handlers will once again be executed.
What is SSIS validation? SSIS validates the package and all of it's tasks to ensure it has been configured correctly. With a given set of configurations and values, all the tasks and package will execute successfully. In other words, during the validation process, SSIS checks if the source and destination locations are accessible and the meta data about the source and destination tables are stored with the package are correct, so that the task will not fail if executed. The validation process reports warnings and errors depending on the validation failure detected. For example, if the source/destination tables/columns get changed/dropped it will show as error. Whereas if you are accessing more columns than used to write to the destination object this will be flagged as a warning. To learn about validation click here.
Define design time validation versus run time validation. Design time validation is performed when you are opening your package in BIDS whereas run time validation is performed when you are actually executing the package.
Define early validation (package level validation) versus late validation (component level validation). When a package is executed, the package goes through the validation process. All of the components/tasks of package are validated before actually starting the package execution. This is called early validation or package level validation. During execution of a package, SSIS validates the component/task again before executing that particular component/task. This is called late validation or component level validation.
What is DelayValidation and what is the significance? As I said before, during early validation all of the components of the package are validated along with the package itself. If any of the component/task fails to validate, SSIS will not start the package execution. In most cases this is fine, but what if the second task is dependent on the first task? For example, say you are creating a table in the first task and referring to the same table in the second task? When early validation starts, it will not be able to validate the second task as the dependent table has not been created yet. Keep in mind that early validation is performed before the package execution starts. So what should we do in this case? How can we ensure the package is executed successfully and the logically flow of the package is correct? This is where you can use the DelayValidation property. In the above scenario you should set the DelayValidation property of the second task to TRUE in which case early validation i.e. package level validation is skipped for that task and that task would only be validated during late validation i.e. component level validation. Please note using the DelayValidation property you can only skip early validation for that specific task, there is no way to skip late or component level validation.
Problem
When preparing for SSIS interview it is important to understand what questions could be asked in the interview. In this tip series (tip 1, tip 2, tip 3), I will try to cover common SSIS interview questions to help you prepare for a future SSIS interview. In this tip we will cover interview questions related to the SSIS architecture and internals. Check it out.
Solution
What are the different components in the SSIS architecture? The SSIS architecture comprises of four main components: o The SSIS runtime engine manages the workflow of the package o The data flow pipeline engine manages the flow of data from source to destination and in-memory transformations o The SSIS object model is used for programmatically creating, managing and monitoring SSIS packages o The SSIS windows service allows managing and monitoring packages To learn more about the architecture click here.
How is SSIS runtime engine different from the SSIS dataflow pipeline engine? The SSIS Runtime Engine manages the workflow of the packages during runtime, which means its role is to execute the tasks in a defined sequence. As you know, you can define the sequence using precedence constraints. This engine is also responsible for providing support for event logging, breakpoints in the BIDS designer, package configuration, transactions and connections. The SSIS Runtime engine has been designed to support concurrent/parallel execution of tasks in the package. The Dataflow Pipeline Engine is responsible for executing the data flow tasks of the package. It creates a dataflow pipeline by allocating in-memory structure for storing data in-transit. This means, the engine pulls data from source, stores it in memory, executes the required transformation in the data stored in memory and finally loads the data to the destination. Like the SSIS runtime engine, the Dataflow pipeline has been designed to do its work in parallel by creating multiple threads and enabling them to run multiple execution trees/units in parallel.
How is a synchronous (non-blocking) transformation different from an asynchronous (blocking) transformation in SQL Server Integration Services? A transformation changes the data in the required format before loading it to the destination or passing the data down the path. The transformation can be categorized in Synchronous and Asynchronous transformation. A transformation is called synchronous when it processes each incoming row (modify the data in required format in place only so that the layout of the result-set remains same) and passes them down the hierarchy/path. It means, output rows are synchronous with the input rows (1:1 relationship between input and output rows) and hence it uses the same allocated buffer set/memory and does not require additional memory. Please note, these kinds of transformations have lower memory requirements as they work on a row-by-row basis (and hence run quite faster) and do not block the data flow in the pipeline. Some of the examples are : Lookup, Derived Columns, Data Conversion, Copy column, Multicast, Row count transformations, etc.
A transformation is called Asynchronous when it requires all incoming rows to be stored locally in the memory before it can start producing output rows. For example, with an Aggregate Transformation, it requires all the rows to be loaded and stored in memory before it can aggregate and produce the output rows. This way you can see input rows are not in sync with output rows and more memory is required to store the whole set of data (no memory reuse) for both the data input and output. These kind of transformations have higher memory requirements (and there are high chances of buffer spooling to disk if insufficient memory is available) and generally runs slower. The asynchronous transformations are also called "blocking transformations" because of its nature of blocking the output rows unless all input rows are read into memory. To learn more about it click here.
What is the difference between a partially blocking transformation versus a fully blocking transformation in SQL Server Integration Services? Asynchronous transformations, as discussed in last question, can be further divided in two categories depending on their blocking behavior: o Partially Blocking Transformations do not block the output until a full read of the inputs occur. However, they require new buffers/memory to be allocated to store the newly created result-set because the output from these kind of transformations differs from the input set. For example, Merge Join transformation joins two sorted inputs and produces a merged output. In this case if you notice, the data flow pipeline engine creates two input sets of memory, but the merged output from the transformation requires another set of output buffers as structure of the output rows which are different from the input rows. It means the memory requirement for this type of transformations is higher than synchronous transformations where the transformation is completed in place. o Full Blocking Transformations, apart from requiring an additional set of output buffers, also blocks the output completely unless the whole input set is read. For example, the Sort Transformation requires all input rows to be available before it can start sorting and pass down the rows to the output path. These kind of transformations are most expensive and should be used only as needed. For example, if you can get sorted data from the source system, use that logic instead of using a Sort transformation to sort the data in transit/memory. To learn more about it click here.
What is an SSIS execution tree and how can I analyze the execution trees of a data flow task? The work to be done in the data flow task is divided into multiple chunks, which are called execution units, by the dataflow pipeline engine. Each represents a group of transformations. The individual execution unit is called an execution tree, which can be executed by separate thread along with other execution trees in a parallel manner. The memory structure is also called a data buffer, which gets created by the data flow pipeline engine and has the scope of each individual execution tree. An execution tree normally starts at either the source or an asynchronous transformation and ends at the first asynchronous transformation or a destination. During execution of the execution tree, the source reads the data, then stores the data to a buffer, executes the transformation in the buffer and passes the buffer to the next execution tree in the path by passing the pointers to the buffers. To learn more about it click here.
To see how many execution trees are getting created and how many rows are getting stored in each buffer for a individual data flow task, you can enable logging of these events of data flow task: PipelineExecutionTrees, PipelineComponentTime, PipelineInitialization, BufferSizeTunning, etc. To learn more about events that can be logged click here.
How can an SSIS package be scheduled to execute at a defined time or at a defined interval per day? You can configure a SQL Server Agent Job with a job step type of SQL Server Integration Services Package, the job invokes the dtexec command line utility internally to execute the package. You can run the job (and in turn the SSIS package) on demand or you can create a schedule for a one time need or on a reoccurring basis. Refer to this tip to learn more about it.
What is an SSIS Proxy account and why would you create it? When we try to execute an SSIS package from a SQL Server Agent Job it fails with the message "Non-SysAdmins have been denied permission to run DTS Execution job steps without a proxy account". This error message is generated if the account under which SQL Server Agent Service is running and the job owner is not a sysadmin on the instance or the job step is not set to run under a proxy account associated with the SSIS subsystem. Refer to this tip to learn more about it.
How can you configure your SSIS package to run in 32-bit mode on 64-bit machine when using some data providers which are not available on the 64-bit platform? In order to run an SSIS package in 32-bit mode the SSIS project property Run64BitRuntime needs to be set to "False". The default configuration for this property is "True". This configuration is an instruction to load the 32-bit runtime environment rather than 64-bit, and your packages will still run without any additional changes. The property can be found under SSIS Project Property Pages -> Configuration Properties -> Debugging. Problem SQL Server Integration Services (SSIS) has grown a lot from its predecessor DTS (Data Transformation Services) to become an enterprise wide ETL (Extraction, Transformation and Loading) product in terms of its usability, performance, parallelism etc. Apart from being an ETL product, it also provides different built-in tasks to manage a SQL Server instance. Although the internal architecture of SSIS has been designed to provide a high degree of performance and parallelism there are still some best practices to further optimize performance. In this tip series, I will be talking about best practices to consider while working with SSIS which I have learned while working with SSIS for the past couple of years. Solution As mentioned above, SSIS is the successor of DTS (of SQL Server 7/2000). If you are coming from a DTS background, SSIS packages may look similar to DTS packages, but it's not the case in reality. What I mean is, SSIS is not an enhancement to DTS but rather a new product which has been written from scratch to provide high performance and parallelism and as a result of this it overcomes several limitations of DTS.
SSIS 2008 has further enhanced the internal dataflow pipeline engine to provide even better performance, you might have heard the news that SSIS 2008 has set an ETL World record of uploading 1TB of data in less than half an hour. The best part of SSIS is that it is a component of SQL server. It comes free with the SQL Server installation and you don't need a separate license for it. Because of this, along with hardcore BI developers, database developers and database administrators are also using it to transfer and transform data. Best Practice #1 - Pulling High Volumes of Data Recently we had to pull data from a source table which had 300 millions records to a new target table. Initially when the SSIS package started, everything looked fine, data was being transferred as expected but gradually the performance degraded and the data transfer rate went down dramatically. During analysis we found that the target table had a primary clustered key and two non-clustered keys. Because of the high volume of data inserts into the target table these indexes got fragmented heavily up to 85%-90%. We used the online index rebuilding feature to rebuild/defrag the indexes, but again the fragmentation level was back to 90% after every 15-20 minutes during the load. This whole process of data transfer and parallel online index rebuilds took almost 12-13 hours which was much more than our expected time for data transfer. Then we came with an approach to make the target table a heap by dropping all the indexes on the target table in the beginning, transfer the data to the heap and on data transfer completion, recreate indexes on the target table. With this approach, the whole process (by dropping indexes, transferring data and recreating indexes) took just 3-4 hours which was what we were expecting. This whole process has been graphically shown in the below flow chart. So the recommendation is to consider dropping your target table indexes if possible before inserting data to it specially if the volume of inserts is very high.
Best Practice #2 - Avoid SELECT * The Data Flow Task (DFT) of SSIS uses a buffer (a chunk of memory) oriented architecture for data transfer and transformation. When data travels from the source to the destination, the data first comes into the buffer, required transformations are done in the buffer itself and then written to the destination. The size of the buffer is dependant on several factors, one of them is the estimated row size. The estimated row size is determined by summing the maximum size of all the columns in the row. So the more columns in a row means less number of rows in a buffer and with more buffer requirements the result is performance degradation. Hence it is recommended to select only those columns which are required at destination. Even if you need all the columns from the source, you should use the column name specifically in the SELECT statement otherwise it takes another round for the source to gather meta-data about the columns when you are using SELECT *. If you pull columns which are not required at destination (or for which no mapping exists) SSIS will emit warnings like this.
[SSIS.Pipeline] Warning: The output column "SalariedFlag" (64) on output "OLE DB Source Output" (11) and component "OLE DB Source" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance. [SSIS.Pipeline] Warning: The output column "CurrentFlag" (73) on output "OLE DB Source Output" (11) and component "OLE DB Source" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.
Beware when you are using "Table or view" or "Table name or view name from variable" data access mode in OLEDB source. It behaves like SELECT * and pulls all the columns, use this access mode only if you need all the columns of the table or view from the source to the destination. Tip : Try to fit as many rows into the buffer which will eventually reduce the number of buffers passing through the dataflow pipeline engine and improve performance. Best Practice #3 - Effect of OLEDB Destination Settings There are couple of settings with OLEDB destination which can impact the performance of data transfer as listed below. Data Access Mode This setting provides the 'fast load' option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statement (for each single row) as in the case for other options. So unless you have a reason for changing it, don't change this default value of fast load. If you select the 'fast load' option, there are also a couple of other settings which you can use as discussed below. Keep Identity By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table. Keep Nulls Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination. Table Lock By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems. Check Constraints Again by default this setting is checked and recommendation is to un-check it if you are sure that the incoming data is not going to violate constraints of the destination table. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load. Best Practice #4 - Effect of Rows Per Batch and Maximum Insert Commit Size Settings Rows per batch The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch. Maximum insert commit size The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers. The above two settings are very important to understand to improve the performance of tempdb and the transaction log. For example if you leave 'Max insert
commit size' to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment. Note: The above recommendations have been done on the basis of experience gained working with DTS and SSIS for the last couple of years. But as noted before there are other factors which impact the performance, one of the them is infrastructure and network. So you should do thorough testing before putting these changes into your production environment. Problem In the first tip (SQL Server Integration Services (SSIS) - Best Practices - Part 1) of this series I wrote about SSIS design best practices. To continue down that path, this tip is going to cover recommendations related to the SQL Server Destination Adapter, asynchronous transformations, DefaultBufferMaxSize and DefaultBufferMaxRows, BufferTempStoragePath and BLOBTempStoragePath as well as the DelayValidation property. Solution In this tip my recommendations are related to different kinds of transformations, impacts for overall SSIS package performance, how memory is managed in SSIS by creating buffers, working with insufficient memory, how SSIS manages spooling when experiencing memory pressure and the significance of the DelayValidation property. Best Practice #5 - SQL Server Destination Adapter It is recommended to use the SQL Server Destination adapter, if your target is a local SQL Server database. It provides a similar level of data insertion performance as the Bulk Insert task and provides some additional benefits. With the SQL Server Destination adapter you can transformation the data before uploading it to the destination, which is not possible with Bulk Insert task. Apart from the options which are available with OLEDB destination adapter, you will get some more options with the SQL Server destination adapter as depicted in the images below. For example, you can specify whether the insert triggers on the target table should fire or not. By default this option is set to false which means no triggers on the destination table will fire. Enabling this option may cause an additional performance hit because the triggers need to fire, but the trigger logic may be needed to enforce data or business rules. Additional options include specifying the number of the first/last rows in the input to load, specifying the maximum number of errors which will cause the bulk load operation to be cancelled as well as specifying the insert column sort order which will be used during the upload process.
Remember if your SQL Server database is on a remote server, you cannot use SQL Server Destination adapter. Instead use the OLEDB destination adapter. In addition, if it is likely that the destination will change from a local to remote instances or from one SQL Server instance to another, it is better to use the OLEDB destination adapter to minimize future changes. Best Practice #6 - Avoid asynchronous transformation (such as Sort Transformation) wherever possible
Before I talk about different kinds of transformations and its impact on performance, let me briefly talk about of how SSIS works internally. The SSIS runtime engine executes the package. It executes every task other than data flow task in the defined sequence. Whenever the SSIS runtime engine encounters a data flow task, it hands over the execution of the data flow task to data flow pipeline engine. The data flow pipeline engine breaks the execution of a data flow task into one more execution tree(s) and may execute two or more execution trees in parallel to achieve high performance. Now if you are wondering what an execution tree is, then here is the answer. An execution tree, as name implies, has a similar structure as a tree. It starts at a source or an asynchronous transformation and ends at destination or first asynchronous transformation in the hierarchy. Each execution tree has a set of allocated buffer and scope of these buffers are associated the execution tree. Also each execution tree is allocated an OS thread (worker-thread) and unlike buffers this thread may be shared by any other execution tree, in other words an OS thread might execute one or more execution trees. Click here for more details on Execution Tree. In SSIS 2008, the process of breaking data flow task into an execution tree has been enhanced to create an execution path and sub-path so that your package can take advantage of high-end multi-processor systems. Click here for more details on SSIS 2008 pipeline enhancements. Synchronous transformations get a record, process it and pass it to the other transformation or destination in the sequence. The processing of a record is not dependent on the other incoming rows. Because synchronous transformations output the same number of records as the input, it does not require new buffers (processing is the done in the same incoming buffers i.e. in the same allocated memory) to be created and because of this it is normally faster. For example, in the Derived column transformation, it adds a new column in the each incoming row, but it does not add any additional records to the output. Unlike synchronous transformations, the asynchronous transformation might output a different number of records than the input requiring new buffers to be created. Because an output is dependent on one or more records it is called a blocking transformation. Depending on the types of blocking it can either be partially blocking or a fully blocking transformation. For example, the Sort Transformation is a fully blocking transformation as it requires all the incoming rows to arrive before processing. As discussed above, the asynchronous transformation requires addition buffers for its output and does not utilize the incoming input buffers. It also waits for all incoming rows to arrive for processing, that's the reason the asynchronous transformation performs slower and must be avoided wherever possible. For example, instead of using Sort Transformation you can get sorted results from the source itself by using ORDER BY clause. Best Practice #7 - DefaultBufferMaxSize and DefaultBufferMaxRows As I said in the "Best Practices #6", the execution tree creates buffers for storing incoming rows and performing transformations. So how many buffers does it create? How many rows fit into a single buffer? How does it impact performance? The number of buffer created is dependent on how many rows fit into a buffer and how many rows fit into a buffer dependent on few other factors. The first consideration is the estimated row size, which is the sum of the maximum sizes of all the columns from the incoming records. The second consideration is the DefaultBufferMaxSize property of the data flow task. This property specifies the default maximum size of a buffer. The default value is 10 MB and its upper and lower boundaries are constrained by two internal properties of SSIS which are MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size of a buffer can
be as small as 64 KB and as large as 100 MB. The third factor is, DefaultBufferMaxRows which is again a property of data flow task which specifies the default number of rows in a buffer. Its default value is 10000. Although SSIS does a good job in tuning for these properties in order to create a optimum number of buffers, if the size exceeds the DefaultBufferMaxSize then it reduces the rows in the buffer. For better buffer performance you can do two things. First you can remove unwanted columns from the source and set data type in each column appropriately, especially if your source is flat file. This will enable you to accommodate as many rows as possible in the buffer. Second, if your system has sufficient memory available, you can tune these properties to have a small number of large buffers, which could improve performance. Beware if you change the values of these properties to a point where page spooling (see Best Practices #8) begins, it adversely impacts performance. So before you set a value for these properties, first thoroughly testing in your environment and set the values appropriately. You can enable logging of the BufferSizeTuning event to learn how many rows a buffer contains and you can monitor "Buffers spooled" performance counter to see if the SSIS has began page spooling. I will talk more about event logging and performance counters in my next tips of this series. Best Practice #8 - BufferTempStoragePath and BLOBTempStoragePath If there is a lack of memory resource i.e. Windows triggers a low memory notification event, memory overflow or memory pressure, the incoming records, except BLOBs, will be spooled to the file system by SSIS. The file system location is set by the BufferTempStoragePath of the data flow task. By default its value is blank, in that case the location will be based on the of value of the TEMP/TMP system variable. Likewise SSIS may choose to write the BLOB data to the file system before sending it to the destination because BLOB data is typically large and cannot be stored in the SSIS buffer. Once again the file system location for the spooling BLOB data is set by the BLOBTempStoragePath property of the data flow task. By default its value is blank. In that case the location will be the value of the TEMP/TMP system variable. As I said, if you don't specify the values for these properties, the values of TEMP and TMP system variables will be considered as locations for spooling. The same information is recorded in the log if you enable logging of the PipelineInitialization event of the data flow task as shown below. User:PipelineInitialization,ARSHADALI-LAP,FAREAST\arali,Data Flow Task,{C80814F8-51A4-4149-8141-D840C9A81EE7},{D1496B27-9FC7-4760-821E80285C33E74D},10/11/2009 1:38:10 AM,10/11/2009 1:38:10 AM,0,0x,No temporary BLOB data storage locations were provided. The buffer manager will consider the directories in the TEMP and TMP environment variables. So far so good. What is important here is to change this default values of the BufferTempStoragePath/BLOBTempStoragePath properties and specify locations where the user executing the package (if the package is being executed by SQL Server Job, then SQL Server Agent service account) has access to these locations. Preferably both locations should refer to separate fast drives (with separate spindles) to maximize I/O throughput and improve performance. Best Practice #9 - How DelayValidation property can help you SSIS uses validation to determine if the package could fail at runtime. SSIS uses two types of validation. First is package validation (early validation) which validates the package and all its components before starting the execution of the package. Second SSIS uses component validation (late validation), which validates the components of the package once started. Let's consider a scenario where the first component of the package creates an object i.e. a temporary table, which is being referenced by the second component of the package. During package validation, the first component has not yet executed, so no
object has been created causing a package validation failure when validating the second component. SSIS will throw a validation exception and will not start the package execution. So how will you get this package running in this common scenario? To help you in this scenario, every component has a DelayValidation (default=FALSE) property. If you set it to TRUE, early validation will be skipped and the component will be validated only at the component level (late validation) which is during package execution.
Problem In the previous tips (SQL Server Integration Services (SSIS) - Best Practices - Part 1 and Part 2) of this series I briefly talked about SSIS and a few of the best practices to consider when designing SSIS packages. Continuing on the same rhythm I am going to discuss some more best practices for SSIS package design, how you can design high performing packages with parallelism, troubleshooting performance problems etc. Solution In this tip my recommendations are around how you can achieve high performance with achieving a higher degree of parallelism, how you can identify the cause of poorly performing packages, how distributed transaction work within SSIS and finally what you can do to restart a package execution from the last point of failure. As you can see this tip starts at best practice #10. See these other two tips (Part 1 and Part 2) for best practices 1-9.
Best Practice #10 - Better performance with parallel execution SSIS has been designed to achieve high performance by running the executables of the package and data flow tasks in parallel. This parallel execution of the SSIS package executables and data flow tasks can be controlled by two properties provided by SSIS as discussed below. MaxConcurrentExecutables - It's the property of the SSIS package and specifies the number of executables (different tasks inside the package) that can run in parallel within a package or in other words, the number of threads SSIS runtime engine can create to execute the executables of the package in parallel. As I discussed in Best Practice #6,the SSIS runtime time engine which executes the package and every task defined in it (except data flow task) in the defined workflow. So as long as you have sequential workflow of the package (one task after another, precedence defined with precedence constraints between tasks) this property would not make any difference. But if you have your package workflow with parallel tasks, this property will make a difference. Its default value is -1, which means total number of available processors + 2, also if you have hyper-threading enabled then it is total number of logical processors + 2. EngineThreads - As I said, MaxConcurrentExecutables is a property of the package and used by SSIS runtime engine for parallel execution of package executables, likewise data flow tasks have the EngineThreads property which is used by the data flow pipeline engine and has a default value of 5 in SSIS 2005 and 10 in SSIS 2008. This property specifies the number of source threads (does data pull from source) and worker thread (does transformation and upload into the destination) that can be
created by data flow pipeline engine to manage the flow of data and data transformation inside a data flow task, it means if the EngineThreads has value 5 then up to 5 source threads and also up to 5 worker threads can be created. Please note, this property is just a suggestion to the data flow pipeline engine, the pipeline engine may create less or more threads if required. Example - Let's consider you have a package with 5 data flow tasks in parallel and MaxConcurrentExecutables property has value 3. When you start the package execution, the runtime will start executing 3 data flow tasks of the package in parallel, the moment any of the executing data flow task completes, the execution of the next waiting data flow task will start and so on. Now what happens inside the data flow task is controlled by EngineThreads property. As I described, how the work of a data flow task is broken down into one or more execution trees in Best Practice #6 (you can see how many execution trees are created for a data flow task by turning on logging for PipelineExecutionTrees event of the data flow task), the data flow pipeline engine might create source and worker threads as many as you have set the value of EngineThreads to execute one or more execution trees in parallel. Here also if you have set EngineThreads property to 5 and also your data flow task is broken down into 5 execution trees, it does not mean all execution trees will run in parallel, to summarize this, sequence matters here as well. Be very careful while changing these properties, do thorough testing before final implementation. Because these properties, if properly configured within the constraints of available system resources, improves the performance by achieving parallelism; on the other hand if it is poorly configured then it will hurt performance because of too much context switching from one thread to another. So as a general rule of thumb do not create and run more threads in parallel than the number of available processors.
Best Practice #11 - When to use events logging and when to avoid... Logging (or tracing the execution) is a great way of diagnosing the problem occurring during runtime, it helps a lot when your code does not work as expected. Nowadays, almost every programming language provides a logging mechanism to identify the root cause of unexpected problems or runtime failures and SSIS is not an exception. SSIS allows you to enable logging, a powerful, flexible and extremely helpful feature, for your packages and its executables. Not only that, it also allows you to choose different events of a package and components of the packages to log as well as the location where the log information is to be written (text files, SQL Server, SQL Server Profiler, Windows Events etc). By now you would be convinced that logging saves you from several hours of frustration that you might get while finding out the causes of problem if you are not using logging, but the story doesn't end here. Its true, it helps you in identifying the problem and its root causes, but at the same time it's an overhead for SSIS and ultimately hits performance as well, especially if you are using logging excessively. So the recommendation here is to enable logging if required, you can dynamically set the value of the LoggingMode property (of a package and its executables) to enable or disable logging without modifying the package. Also you should choose to log for only those executables which you suspect to have problems and further you should only log those events which are absolutely required for troubleshooting. Read more about logging
Best Practice #12 - Monitoring the SSIS Performance with Performance Counters In Best Practice #11, I discussed how you can turn on event logging for your package and its component and analyze the performance related problems. Apart from that, SSIS also introduced (was not available with DTS) system performance counters to monitor the performance of your SSIS runtime and data flow pipeline engines. For example, SSIS Package Instance counter indicates the number of SSIS packages running on the system; Rows read and Rows written counters indicate the total number of rows coming from the source and total number of rows provided to destination; Buffers in use and Buffer memory counters indicate the total number buffers created and amount of memory used by them; Buffer spooling is a very important counter and tells about number of buffers (which are not currently in use) written to the disk when physical memory runs low; BLOB bytes read, BLOB bytes written and BLOB files in use counters give detail about the BLOB data transfer and tells about number of BLOB bytes read, written and total number of files that the data flow engine currently is using for spooling BLOB data etc. An exhaustive list of all the SSIS performance counters can be found here. If you upgrade Windows Server 2003, on which you you have installed SQL Server and SSIS, to Windows Server 2008, SSIS performance counters will disappear, this happens because the upgrade process removes the SSIS performance counters from the system. To restore them back, refer to this KB article.
Best Practice #13 - Distributed Transaction in SSIS and its impact... SSIS allows you to group two more executables to execute within a single transaction by using distributed transaction (however you need to start Distributed Transaction Coordinator windows service). Though at first glance it sounds cool, but it might have blocking issues especially if you have a task in the middle which takes longer to execute. For example, let's consider a scenario (though it is a very vague example but will explain the scenario), you have a group of data flow tasks, a Web service task and then again a data flow task in sequence. First data flow task pulls data from source to staging and completes in minutes, web service task pulls data from a web service to staging and takes hours to complete and the last data flow task merges these data and uploads into a final table. Now if you execute all these three task in single transaction, the resource blocked by first data flow task will be blocked until the end even though its not being used anymore while the web service task is executing. So recommendation here is, even though SSIS provides distributed transaction support on a group of executables but you should use only when it's absolutely necessary. Even if you are using it, try to keep time taking tasks outside the group and set the IsolationLevel property prudently for the group transaction. Wherever possible you should avoid it and use some other alternatives which suits your particular scenario.
Best Practice #14 - How Checkpoint features helps in package restarting
SSIS has a cool new feature called Checkpoint, which allows your package to start from the last point of failure on next execution. By enabling this feature you can save a lot of time for successfully executed tasks and start the package execution from the task which failed in last execution. You can enable this feature for your package by setting three properties (CheckpointFileName, CheckpointUsage and SaveCheckpoints ) of the package. Apart from that you need to set FailPackageOnFailure property to TRUE for all tasks which you want to be considered in restarting, what I mean here is if you set this property then only on failure of that task, the package fails, the information is captured in the checkpoint file and on subsequent execution, the execution starts from that task. So how does it work? When you enable checkpoint for a package, execution status is written in the checkpoint file (name and location of this file is specified with CheckpointFileName property). On subsequent executions, the runtime engine refers to the checkpoint file to see last execution status before starting the package execution, if it finds failure in last execution it knows where the last execution failed and starts the execution from that task only. So if you delete this file before the subsequent execution, the package execution will start from the beginning even though the last execution failed as it has no way to identify this. By enabling this feature, you can save lots of time (as data pull or transformation on huge data volume takes much time) during subsequent execution by skipping the tasks which executed successfully in the last run and start from the task which failed. One very important point to note here, you can enable a task to participate in checkpoint including data flow task but it does not apply inside the data flow task. In other words, at data flow task level only you can enable it, you cannot set checkpoint inside the data flow task, for example at a transformation level inside it. Let's consider a scenario, you have a data flow task for which you have set FailPackageOnFailure property to TRUE to participate in checkpoint. Now inside the data flow task, you have five transformations in sequence and execution fails at 5th (earlier 4 transformations have completed successfully). On subsequent execution the execution will start from the data flow task and the first 4 transformations will run again before coming to 5th one. Read more about checkpoints
Note The above recommendations have been done on the basis of experience gained working with DTS and SSIS for the last couple of years. But as noted before there are other factors which impact the performance, one of the them is infrastructure and network. So you must do thorough testing before putting these changes into your production environment. Problem In the previous tips (SQL Server Integration Services (SSIS) - Best Practices - Part 1, Part 2 and Part 3) of this series I briefly talked about SSIS and few of the best practices to consider while designing SSIS packages. Continuing on this path I am going to discuss some more best practices of SSIS package design, how you can use lookup transformations and what considerations you need to take, the impact of implicit type cast in SSIS, changes in SSIS 2008 internal system tables and stored procedures and finally some general guidelines.
Solution In this tip my recommendations are around lookup transformation and different considerations which you need to take while using it, changes in SSIS 2008 system tables and stored procedures, impact of implicit typecast and finally some general guidelines at the end.
As you can see this tip starts at best practice #15. See these other tips (Part 1, Part 2 and Part3) for best practices 1-14.
Best Practice #15 - Did you know, SSIS is case-sensitive? In one of my projects, once we added one new column in a source table and wanted it to be transferred to a destination table as well. We did the required changes in our SSIS package to pull the data for this additional column. But when we started testing we noticed our SSIS package was failing with the following error. [OLE DB Destination [16]] Warning: The external columns for component "OLE DB Destination" (16) are out of synchronization with the data source columns. The column "EmployeeId" needs to be added to the external columns. The external column "EmployeeID" (82) needs to be removed from the external columns. [SSIS.Pipeline] Error: "component "OLE DB Destination" (16)" failed validation and returned validation status "VS_NEEDSNEWMETADATA". After spending several frustrating hours investigating the problem, we noticed that even though our SQL Server is case insensitive, the SSIS package is case sensitive. The reason for the above failure was that we altered the table for one column from EmployeeId to EmployeeID and since the SSIS package has stored source and destination column mappings with the old definition, it started failing because of this column name case change. So whenever you get this kind of error, match your source/destination columns case with the mapping stored in the SSIS package by going to the mapping page of OLEDB destination adaptor of the Data Flow Task.
Best Practice #16 - Lookup transformation consideration In the data warehousing world, it's a frequent requirement to have records from a source by matching them with a lookup table. To perform this kind of transformation, SSIS has provides a built-in Lookup transformation. Lookup transformation has been designed to perform optimally; for example by default it uses Full Caching mode, in which all reference dataset records are brought into memory in the beginning (pre-execute phase of the package) and kept for reference. This way it ensures the lookup operation performs faster and at the same time it reduces the load on
the reference data table as it does not have to fetch each individual record one by one when required. Though it sounds great there are some gotchas. First you need to have enough physical memory for storage of the complete reference dataset, if it runs out of memory it does not swap the data to the file system and therefore it fails the data flow task. This mode is recommended if you have enough memory to hold reference dataset and. your referenced data does not change frequently, in other words, changes at reference table will not be reflected once data is fetched into memory. If you do not have enough memory or the data does change frequently you can either use Partial caching mode or No Caching mode. In Partial Caching mode, whenever a record is required it is pulled from the reference table and kept in memory, with it you can also specify the maximum amount of memory to be used for caching and if it crosses that limit it removes the least used records from memory to make room for new records. This mode is recommended when you have memory constraints and your reference data does not change frequently. No Caching mode performs slower as every time it needs a record it pulls from the reference table and no caching is done except the last row. It is recommended if you have a large reference dataset and you don't have enough memory to hold it and also if your reference data is changing frequently and you want the latest data. More details about how the Lookup transformation works can be found here whereas lookup enhancements in SSIS 2008 can be found here. To summarize the recommendations for lookup transformation: Choose the caching mode wisely after analyzing your environment and after doing thorough testing If you are using Partial Caching or No Caching mode, ensure you have an index on the reference table for better performance. Instead of directly specifying a reference table in he lookup configuration, you should use a SELECT statement with only the required columns. You should use a WHERE clause to filter out all the rows which are not required for the lookup. In SSIS 2008, you can save your cache to be shared by different lookup transformations, data flow tasks and packages, utilize this feature wherever applicable.
Best Practice #17 - Names of system tables and procedures have changed between SSIS 2005 and SSIS 2008 SSIS gives you different location choices for storing your SSIS packages, for example you can store at file system, SQL server etc. When you store a package on SQL Server, it is stored in the system tables in msdb database. You can write your own code to upload/download packages from these system tables or use un-document system stored procedures for these tasks.
Now the twist in the story is, since SSIS 2005 has grown up from DTS, the system tables and system stored procedures use a naming convention like "dts" in its name as you can see in first column of table below. With SSIS 2008, the SSIS team has standardize the naming convention and uses "ssis" in its name as you can see in the second column of the table below. So if you are using these system tables or system stored procedure in your code and upgrading to SSIS 2008, your code will break unless you change your code to accommodate this new naming convention. SSIS 2005 SSIS 2008
System Tables: System Tables: sysdtspackagefolders90 sysssispackagefolders sysdtspackages90 sysssispackages System Procedures: sp_dts_addfolder sp_dts_putpackage sp_dts_deletefolder sp_dts_deletepackage sp_dts_getfolder sp_dts_getpackage sp_dts_listfolders sp_dts_listpackages sp_dts_renamefolder System Procedures: sp_ssis_addfolder sp_ssis_putpackage sp_ssis_deletefolder sp_ssis_deletepackage sp_ssis_getfolder sp_ssis_getpackage sp_ssis_listfolders sp_ssis_listpackages sp_ssis_renamefolder
Best Practice #18 - Be aware of implicit typecast When you use Flat File Connection Manager, it treats all the columns as string [DT_STR] data type. You should convert all the numeric data to appropriate data type or else it will slow down the performance. You are wondering how? Actually SSIS uses buffer oriented architecture (refer Best Practice #6 and #7 for more details on this), it means it pulls the data from the source into the buffers, does the transformations in the buffers and passes it to the destinations. So as many rows as SSIS can accommodate in a single buffer, performance will be better. By having all the columns as string data type you are forcing SSIS to acquire more space in the buffer for numeric data types also (by treating them as string) and hence performance degradation. Tip : Try to fit as many rows as you can into the buffer which will eventually reduce the number of buffers passing through the SSIS dataflow pipeline engine and improve overall performance.
Best Practice #19 - Finally some more general SSIS tips Merge or Merge Join component requires incoming data to be sorted. If possible pull a sorted result-set by using ORDER BY clause at the source instead of using the Sort Transformation. Though there are times, you will be required to use Sort transformation for example pulling unsorted data from flat files. As I said above there are few components which require data to be sorted as input to them. If your incoming data is already sorted then you can use the IsSorted property
of output of the source adapter and specify the sort key columns on which the data is sorted as a hint to these components. Try to maintain a small number of larger buffers and try to get as many row as you can into a buffer by removing unnecessary columns (discussed in Best Practice #2) or by tuning DefaultBufferMaxSize and DefaultBufferMaxRows properties of data flow task (discussed in Best Practice #7) or by using the appropriate data type of the column (discussed in Best Practice #18). If you are on SQL server 2008, you can utilize some of its features for better performance. For example you can use the MERGE statement for joining INSERT and UPDATE data in a single statement while incrementally uploading data (no need for lookup transformation) and Change Data Capture for incremental data pulls. RunInOptimizedMode (default FALSE) property of data flow task can be set to TRUE to disable columns for letting them flow down the line if they are not being used by downstream components of the data flow task. Hence it improves the performance of the data flow task. The SSIS project also has the RunInOptimizedMode property, which is applicable at design time only, which if you set to TRUE ensures all the data flow tasks are run in optimized mode irrespective of individual settings at the data flow task level. Make use of sequence containers to group logical related tasks into a single group for better visibility and understanding. By default a task, like Execute SQL task or Data Flow task, opens a connection when starting and closes it once its execution completes. If you want to reuse the same connection in multiple tasks, you can set RetainSameConnection property of connection manager to TRUE, in that case once the connection is opened it will stay open so that other tasks can reuse and also in that single connection you can use transactions spanning multiple tasks even without requiring the Distributed Transaction Coordinator windows service. Though you can reuse one connection with different tasks but you should also ensure you are not keeping your connection/transaction open for longer. You should understand how protection level setting works for a package, how it saves data (in encrypted form by using User key or password) or it does not save data at all and what impact it has if you move your package from one system to another, refer here for more details on this.
Note The above recommendations have been done on the basis of experience gained working with DTS and SSIS for the last couple of years. But as noted before there are other factors which impact the performance, one of the them is infrastructure and network. So you must do thorough testing before putting these changes into your production environment. I am closing this series on SQL Server Integration Services (SSIS) - Best Practices with this Part 4, if users find any other best practices (I am sure there might be several others) which I missed here, I request you to kindly provide your comments on that so that other can get benefited with our experiences.

Ssis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ssis

Uploaded by

Copyright:

Available Formats

Problem

Best Practice #14 - How Checkpoint features helps in package restarting

You might also like