Thursday, July 5, 2012

SSIS:Best Practices for Better Performance

This article intends to cover the performance improvement techniques and performance constraint scenarios based on the developer’s scope only.

Choice of Transformations
In a real world we would have to do several transformations before the data is actually loaded. The transformations use the buffer memory which in turn affects the performance.So it is very important to understand which transformations influence the performance and how
The transformations can be categorized in to Fully Blocking Transformations, Semi-Blocking Transformations and Non – Blocking transformations.
Fully Blocking Transformations: Blocks the entire dataset to perform the transformation.
Semi-Blocking: Blocks group of data to perform the transformations.
Non – Blocking: No blocking of datasets.
As a general rule, we should try to reduce the number of blocking and semi-blocking transformations.


Non-Blocking transformations
Semi-blocking transformations
Blocking transformations
Audit
Data Mining Query
Aggregate
Character Map
Merge
Fuzzy Grouping
Conditional Split
Merge Join
Fuzzy Lookup
Copy Column
Pivot
Row Sampling
Data Conversion
Unpivot
Sort
Derived Column
Term Lookup
Term Extraction
Lookup
Union All

Multicast


Percent Sampling


Row Count


Script Component


Export Column


Import Column


Slowly Changing Dimension


OLE DB Command

Extra Columns
The most common mistake one does while starting to develop an ssis package is to choose all the columns even if some of them are not required. This might not really sound like a big deal. Consider a scenario where you need to use two fields from a source table which has hundred odd fields. The dataset uses much more buffer size than actually required.
Ensure you always select only those columns which are requires. SSIS by default shows warning messages of column names which are not used.
Configuring Look ups
One of the most common transformations used in SSIS .
The default lookup query for the Lookup Transform is
SELECT * FROM …
The look up allows you to select the table for a look up or a sql query. It is always advisable to use a sql query and only choose the respective columns.
Enabling full caching in look ups enhances the performance of the transformation. However this works only if there are no duplicate records. Another common issue occurs with the blank spaces in the fields for look up. The look up returns no matching data. Its better you trim the fields to get matching records in full cache mode.
Using of SCD
The Slowly changing dimensions are used normally for insert, update or delete records in the table based on the source table data.An alternative approach for this purpose could be done in sql query if both the source and destination is in the same server or through linked servers.
The merge functionality in SQL server 2008 onwards lets you do just that. Also you could write a join query to find those matching records which needs to be changes.
Configuring Source component
Some of the transformations can be totally avoided if they are performed in the source component. Joining two tables from same server, filtering data , sorting data or grouping them can be performed in simple sql query in the source components.
Configuring Destination  component
Fast load vs. normal load
The difference is simple, the former is bulk insert while the later is a row by row insert.(Use a SQL  profiler to see the difference) If you are quite sure about the data that is being processed and if you want to considerably reduce the time taken for huge data insert then Fast load is THE ONE which you need to do. However there are some draw backs, you can’t divert the specific error rows. This is because when there is an error the entire bulk fails. A work around for this is to redirect that failed batch to another destination and do a row by row insert to the same table and get the error record redirected.
OLEDB Adaptor vs. SQL adaptor
If the package is executed on the same machine then using SQL Server adaptor as destination improves the performance considerably.
SSIS Properties
The buffer used for DFT can be altered by the properties DefaultBufferMaxSize and DefaultBufferMaxRows.By increasing them the number of buffers through the data flow. However this should not be increased too much which in turn affects the disk space and does not serve the purpose.
Parallel execution of tasks can be increased by the property MaxConcurrentExecutables.Along with the EngineThreads propery which controls the number of worker threads you need to figure out the right number of parallel executables.
Configuring Flat File Source
While using flat file source it is important we don’t do any unnecessary conversions of the columns .By default all the data are read as strings, so it is important you convert only those columns which require conversions to other type. (Including nvarchar to varchar)
 FastParse indicates whether the column uses the quicker, but locale-insensitive, fast parsing routines that Integration Services provides 
Setting the Fast Parse option in flat file source improves the performance by 7 to 20 % in large files.
Usage of Indexes
I can’t conclude the performance chapter without a mention on the usage of indexes. The Indexes could be a huge constraint while inserting high volumes of data in to tables with several indexes. A work around is to drop and recreate the indexes while inserting data.
Also ensure any unused indexes should be removed from the table.
On the other hand indexes are useful in the source table. Hence put some thought while creating /deleting them.

3 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete

Creating a Meta Data Driven SSIS Solution with Biml

Biml Biml ( Business Intelligence Markup Language ) is a markup language that enables you to quickly represent a variety of database ...