SSIS: Dynamically setting the Default Buffer Max Rows property

I’ve blogged in the past about the Default Buffer Max Rows property in SSIS, and how powerful it can be. As a recap, there are 2 properties in SSIS that control the amount of data that your ETL package will hold in any given buffer: The DefaultBufferMaxRows property, and the DefaultBufferSize. When performance tuning SSIS packages, we can modify either one of these to adjust how much data SSIS can carry through a buffer at once.  By default, this is 10,000 rows or 10 MB. One thing I didn’t like about my “CalculateDefaultBufferSize” spreadsheet is that it made an assumption that you were going to change your DefaultBufferSize to 100MB. So this is 100 MB per component that is running in our data flow, which can become extremely costly. If you leave it at the default however, you’re  still using 10 MB per component no matter what. If we have a really skinny data set we’re importing (such as a fact table or a hub), it’s very possible we could be leaving a lot of empty space in the data flow if we stop every 10,000 rows. Conversely,  if we have a really wide dataset that we can only fit 6,000 or 7,000 rows in a buffer, we’re having to make extra hops to move that same data. It’s always a balancing act. I recently decided to see if I could automate the setting of this property in order to maximize the efficiency of my SSIS packages without needing to manually calculate it out every time.

To start, we need to find our destination table and identify the maximum row length in that table. Fortunately, this can be obtained with a simple query (using DimCurrency in the AdventureWorksDW2012):

SELECT
    SUM (max_length)  [row_length]
FROM  sys.tables t
JOIN sys.columns c
    ON t.object_id=c.object_id
JOIN sys.schemas s
    ON t.schema_id=s.schema_id
WHERE t.name = ‘DimCurrency’
    AND s.name = ‘dbo’

You could easily turn this into a procedure to take the tablename and schema as a parameter, and return it. For this post, I’m going to leave it as a query. Once we have this, it becomes a simple exercise in math. The table above returns a max row length of 118 (in bytes), so we take this value and divide it into 1our default value (which fortunately is stored in SSIS in bytes as well – as 10485760). So our calculation becomes:

10485760/118

giving us a result of 88,862 rows. We’ll round down to the closest 100, and that will be our default buffer max rows property. Now how do we make this dynamic? To start, we open up our SSIS package and add an Execute SQL task into the Control Flow. A good item to improve upon this would be take this and put in a Pre-Execute SQL task, and write the value to an information table. Then you can log it and do some reporting around what your buffers are. As an added bonus with this dynamic structure: as your data scales, your ETL package will too!

image

Next we want to create 3 variables: str_table_name, str_schema_name, and int_row_size. str_table_name and str_schema_name should be strings, and int_row_size should be an Int32. For the int_row_size variable, set the Value to 1. This is going to avoid an error in our expression later. This way we can parameterize our inputs for our destination tables and the corresponding values, and create a placeholder for our incoming result.  If your package names are the same as the destination tables then instead of having to assign this value manually you could just configure it from an expression, using the @[System::PackageName] global variable:

image

Let’s open up our Execute SQL Task and configure it. The connection will point to our destination database, and the SQL query will be as we’ve outlined above, subbing parameters for the table and schema name:

image

Do the parameter mapping on the 2nd tab to parameterize your table and schema with the variables you created above, and set your result set to single row. The single row result set should map to your int_row_size variable. Now for the fun part. Create a data flow task, and right click on it and go to properties.

image

Under expressions, select DefaultBufferMaxRows and expand the expression ellipses. For the expression, use the following logic:

@[User::int_row_size]==1? 10000: 10485760/ @[User::int_row_size]

The initial test of the value is to ensure that if the sql fails to get the proper row size, your buffers remain at the default. And that’s it! Now you can dynamically set your default buffer max rows for ssis!

Now that’s some awesome-sauce.

Chris

Advertisements

SSISDB Reporting: Now What?

Thanks again to everyone who attended my session today on SSIS DB Reporting: Now What? At the end of the presentation there were several questions that came up that we didn’t have time to answer. Below is the Q&A from the session, I hope this answers everything for everyone. Please feel free to contact me Twitter (@sqlbischmidt) or via the comments section if you have any additional questions. The code for the SSRS reports and the PowerPivot workbook is here.

 

  • I see a lot of system tables here. Over a period, these tables could accumulate enormous amount of log data. Is there a built in archive/purge job OR do we have to build our own?
    • It’s a customizable property. If you right click on the SSISDB Catalog, and select Properties
      • image
    • There is a retention period property, which is the number of days SSIS keeps a log of the history.
      • image
  • We are using SSIS 2008 in our company and we have adopted file based deployment for SSIS package as opposed to deployment to MSDB. With 2012, I see a lot of features with SSISDB. Do you think we will have a disadvantage to continue with File based deployment, when my company moves to SSIS 2012?
    • I do, simply because of everything the project deployment model in 2012 gives you. While I didn’t explore it too much for this session, I’m sure you can link the SSISDB and MSDB to get some really powerful reports. Eventually, I think project deployment models will become the norm, but right now it’s a period of transition for everyone. It’s still pretty new.
  • Can I deploy an SSIS project from TFS through power shell?
    • In Windows Server 2012 and SQL Server 2012, a lot of the components and classes have been opened up to PowerShell, I don’t believe there would be any reason you couldn’t.
  • Is the duration for each control flow item logged in the execution table?
    • Not within the execution table, but there is another one called executable_statistics that contains the information for each component within the control flow. The execution table only contains information about packages as a whole.
  • Since you mentioned Source Control System, is there any "Source COntrol" in 2012?
    • TFS, Subversion, and others have all been adding capabilities to handle 2012.
  • Do you need Admin permissions to query the SSISDB tables?
    • No, you don’t necessarily need to have admin access. Your admin could grant you the same access as you could get to any db.
  • Does BI-xPress SSIS Console Monitor use the reports from SSISDB Catalog or is it based on custom logging?
    • It’s based on something custom.
  • Do all SSIS packages have to run via SQL Agent Job?
    • No, you can call SSIS packages in your preferred choice. dtexec, through the IS catalog (2012 Project deployment model), through the integration services instance (2012 Package deployment model), or via tsql (either project or package deployment model in 2012)
  • Have they added support for the geography data type
    • Not that I’m aware of. SSIS still treats geospatial data as a DT_IMAGE(BLOB) when it’s moving across. For more information, go to this link. (http://bit.ly/UqDFZa)
  • what if I am using a project deployment model but only want to deploy one of the packages in the project?
    • In the current release, this wouldn’t be possible. You would have to keep the project as either the package deployment model and deploy them one at a time or in the project deployment model and deploy the entire project together.
  • Can you still deploy using manifest file?
    • Yes, as long as you are using the package deployment model (see next question)
  • How has deployment of SSIS packages changed from 2008 to 2012?
    • In terms of the package deployment model, not very much. Everything that was done before is still (technically) supported. The biggest change is the project deployment model. This gives the ability to treat a collection of SSIS packages
  • can we disable datataps?
    • Yes, datataps aren’t a default. You’d have to specifically reference them. In fact, datataps do affect performance and buffering, so make sure you have an actual need to put them in your package. When calling via tsql in either job step or ad hoc, there is an add_data_tap stored procedure that you need to call to execute your package. No xp_cmdshell! 🙂