SSIS: Password Policy Rules Not Met When Creating Catalog

Last week I was at a client site showing them how to set up the Integration Services Catalog. We were walking through the steps in how to create it and we input the password into the Create Catalog configuration screen. It was a strong password, that adhered to all domain requirements. However, when we clicked Ok, we received the following error:

TITLE: Microsoft SQL Server Management Studio

——————————

An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.Management.IntegrationServices)

——————————

ADDITIONAL INFORMATION:

The transaction ended in the trigger. The batch has been aborted.

========================

   PASSWORD POLICY RULES NOT MET

========================

Login: “##MS_SSISServerCleanupJobLogin##” either has “Enforce password policy” check set to OFF

OR

has “Enforce password expiration” check set to ON.

[Correct Values: “Enforce password policy” = Checked; “Enforce password expiration”= Unchecked ]

Rolling back…

===============================================

(Microsoft SQL Server, Error: 3609)

For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&ProdVer=11.00.3000&EvtSrc=MSSQLServer&EvtID=3609&LinkId=20476

——————————

BUTTONS:

OK

——————————

Ok, let’s check our password again. Type a stronger password, try it again. Still no luck. Going out to Bingoogle didn’t provide much assistance, apart from some old SQL Server 2000 issues with this from the database engine when creating a login. So that’s not it.

Let’s go check the local security policy and the domain policy. Yep, password meets requirements. Don’t see anything that’s terribly out of wack. So my next step was to go check the event viewer and application logs to look for an error, but there was nothing there. This server was created by another team off-site, so let’s go check the OS locale settings and the collation level for SQL Server to see if there are some settings off somewhere. still nothing! At this point I’m starting to sweat a little. In this environment, we’d created two servers (one for SSIS and the DB Engine and one for SSAS), so let’s switch over to the other server and see if it happens there too. yep, same problem. My next step was to create an inbound rule in the firewall settings, thinking that maybe there was some port blocking happening somewhere. A twitter out reach on #sqlhelp didn’t yield any results either. Although I do want to throw a quick thanks out to Robert Davis and Denny Cherry for the assistance!

When that still failed, I figured I’d just try and create the Catalog with Powershell. Fortunately, the code to do this is readily available on MSDN:

# Load the IntegrationServices Assembly
[Reflection.Assembly]::LoadWithPartialName("Microsoft.SqlServer.Management.IntegrationServices")

# Store the IntegrationServices Assembly namespace to avoid typing it every time
$ISNamespace = "Microsoft.SqlServer.Management.IntegrationServices"

Write-Host "Connecting to server ..."

# Create a connection to the server
$sqlConnectionString = "Data Source=localhost;Initial Catalog=master;Integrated Security=SSPI;"
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $sqlConnectionString

# Create the Integration Services object
$integrationServices = New-Object $ISNamespace".IntegrationServices" $sqlConnection

# Provision a new SSIS Catalog
$catalog = New-Object $ISNamespace".Catalog" ($integrationServices, "SSISDB", "P@assword1")
$catalog.Create()

I ran it locally on the server, and it failed with the same issue. There has to be something in the internal scripts causing this issue. So my next step was to go back to the basics. We do know that the issue is being caused by this specific login as a part of the process: “##MS_SSISServerCleanupJobLogin##”. Well, CleanupJobLogin sounds like the account that’s used to run the SSIS Server Maintenance Job. If we can get everything to create except for the job, that gives us the Catalog and the main SSISDB. All that login does is serve as the job owner for the “SSIS Server Maintenance Job”, which runs two procedures in the SSISDB: [internal].[cleanup_server_retention_window] and [internal].[cleanup_server_project_version].

Fire up Profiler, and run a trace. Go back in to “Create Catalog”, type the password, and click Ok. I know it’s going to fail, but that’s not what I’m after here. Go into the trace and capture all of the SQL scripts that the Catalog is running.

select SERVERPROPERTY(N’servername’)
DECLARE @edition sysname;
SET @edition = cast(SERVERPROPERTY(N’EDITION’) as sysname);
select case when @edition = N’SQL Azure’ then 2 else 1 end as ‘DatabaseEngineType’

exec sp_executesql N’SELECT
”IntegrationServices[@Name=” + quotename(CAST(SERVERPROPERTY(N”Servername”) AS sysname),””””) + ”]” AS [Urn],
CAST(SERVERPROPERTY(N”Servername”) AS sysname) AS [Name]
WHERE
(CAST(SERVERPROPERTY(N”Servername”) AS sysname)=@_msparam_0)’,N’@_msparam_0 nvarchar(4000)’,@_msparam_0=N’PWIRTPTPSP3W12′

SELECT ISNULL(HAS_PERMS_BY_NAME(null, null, ‘ALTER SETTINGS’), 0)
SELECT ISNULL(IS_SRVROLEMEMBER (‘sysadmin’), 0)
GO
————————————————————–

USE [master]
select SERVERPROPERTY(N’servername’)
DECLARE @edition sysname;
SET @edition = cast(SERVERPROPERTY(N’EDITION’) as sysname); select case when @edition = N’SQL Azure’ then 2 else 1 end as ‘DatabaseEngineType’

exec sp_executesql N’SELECT
”IntegrationServices[@Name=” + quotename(CAST(SERVERPROPERTY(N”Servername”) AS sysname),””””) + ”]” AS [Urn],
CAST(SERVERPROPERTY(N”Servername”) AS sysname) AS [Name]
WHERE
(CAST(SERVERPROPERTY(N”Servername”) AS sysname)=@_msparam_0)’,N’@_msparam_0 nvarchar(4000)’,@_msparam_0=N’PWIRTPTPSP3W12′
GO

———————————————————————————–

USE [master]
–exec sp_reset_connection
DECLARE @edition sysname;
SET @edition = cast(SERVERPROPERTY(N’EDITION’) as sysname); select case when @edition = N’SQL Azure’ then 2 else 1 end as ‘DatabaseEngineType’
SELECT
cfg.name AS [Name],
cfg.configuration_id AS [Number],
cfg.minimum AS [Minimum],
cfg.maximum AS [Maximum],
cfg.is_dynamic AS [Dynamic],
cfg.is_advanced AS [Advanced],
cfg.value AS [ConfigValue],
cfg.value_in_use AS [RunValue],
cfg.description AS [Description]
FROM
sys.configurations AS cfg

exec sp_executesql N’
        –Preparing to access the Catalog object
        DECLARE @t_catalogs TABLE (
        Name sysname COLLATE SQL_Latin1_General_CP1_CI_AS,
        EncryptionAlgorithm nvarchar(256) COLLATE SQL_Latin1_General_CP1_CI_AS,
        SchemaVersion int,
        SchemaBuild nvarchar(256) COLLATE SQL_Latin1_General_CP1_CI_AS,
        OperationLogRetentionTime int,
        MaxProjectVersions int,
        OperationCleanupEnabled bit,
        VersionCleanupEnabled bit,
        ServerLoggingLevel int,
        OperationLogNumberOfRecords int,
        VersionLogNumberOfRecords int)

        IF DB_ID(”SSISDB”) IS NOT NULL
        BEGIN
        INSERT INTO @t_catalogs VALUES(
        ”SSISDB”,
        (SELECT [property_value] FROM [SSISDB].[catalog].[catalog_properties] WHERE [property_name]  = N”ENCRYPTION_ALGORITHM”),
        (SELECT CAST([property_value] AS INT) FROM [SSISDB].[catalog].[catalog_properties] WHERE [property_name]  = N”SCHEMA_VERSION”),
        (SELECT [property_value] FROM [SSISDB].[catalog].[catalog_properties] WHERE [property_name]  = N”SCHEMA_BUILD”),
        (SELECT CAST([property_value] AS INT) FROM [SSISDB].[catalog].[catalog_properties] WHERE [property_name]  = N”RETENTION_WINDOW”),
        (SELECT CAST([property_value] AS INT) FROM [SSISDB].[catalog].[catalog_properties] WHERE [property_name]  = N”MAX_PROJECT_VERSIONS”),
        (SELECT CAST([property_value] AS BIT) FROM [SSISDB].[catalog].[catalog_properties] WHERE [property_name]  = N”OPERATION_CLEANUP_ENABLED”),
        (SELECT CAST([property_value] AS BIT) FROM [SSISDB].[catalog].[catalog_properties] WHERE [property_name]  = N”VERSION_CLEANUP_ENABLED”),
        (SELECT CAST([property_value] AS INT) FROM [SSISDB].[catalog].[catalog_properties] WHERE [property_name]  = N”SERVER_LOGGING_LEVEL”),
        (SELECT COUNT(operation_id) FROM [SSISDB].[catalog].[operations]),
        (SELECT COUNT(object_id) FROM [SSISDB].[catalog].[object_versions])
        )
        END

SELECT
”IntegrationServices[@Name=” + quotename(CAST(SERVERPROPERTY(N”Servername”) AS sysname),””””) + ”]” + ”/Catalog[@Name=” + ”””” + REPLACE((SELECT Name from @t_catalogs), ””””, ””””””) + ”””” + ”]” AS [Urn],
(SELECT Name from @t_catalogs) AS [Name],
(SELECT EncryptionAlgorithm from @t_catalogs) AS [EncryptionAlgorithm],
(SELECT SchemaVersion from @t_catalogs) AS [SchemaVersion],
(SELECT SchemaBuild from @t_catalogs) AS [SchemaBuild],
(SELECT OperationLogRetentionTime from @t_catalogs) AS [OperationLogRetentionTime],
(SELECT MaxProjectVersions from @t_catalogs) AS [MaxProjectVersions],
(SELECT OperationCleanupEnabled from @t_catalogs) AS [OperationCleanupEnabled],
(SELECT VersionCleanupEnabled from @t_catalogs) AS [VersionCleanupEnabled],
(SELECT ServerLoggingLevel from @t_catalogs) AS [ServerLoggingLevel],
(SELECT OperationLogNumberOfRecords from @t_catalogs) AS [OperationLogNumberOfRecords],
(SELECT VersionLogNumberOfRecords from @t_catalogs) AS [VersionLogNumberOfRecords]
WHERE
(CAST(SERVERPROPERTY(N”Servername”) AS sysname)=@_msparam_0)’,N’@_msparam_0 nvarchar(4000)’,@_msparam_0=N’PWIRTPTPSP3W12′

SELECT ISNULL(IS_SRVROLEMEMBER (‘sysadmin’), 0)

exec sp_executesql N’SELECT name FROM msdb.sys.sysdatabases WHERE name = @dbname’,N’@dbname nvarchar(6)’,@dbname=N’SSISDB’

declare @key_value nvarchar(1024); exec master.dbo.xp_regread ‘HKEY_LOCAL_MACHINE’,’SOFTWARE\Microsoft\Microsoft SQL Server\110\SSIS\Setup\DTSPath’, N”, @key_value output;select @key_value;

DECLARE @CatalogFileExists bit
if object_id(‘tempdb..#t’) is not null drop table #t
BEGIN
       DECLARE @CatalogFile nvarchar(1024)
       SELECT @CatalogFile = N’C:\Program Files\Microsoft SQL Server\110\DTS\Binn\SSISDBBackup.bak’
       CREATE TABLE #t (file_exists int, is_directory int, parent_directory_exists int)
       INSERT #t EXEC xp_fileexist @CatalogFile
       SELECT TOP 1 @CatalogFileExists=file_exists from #t
END
SELECT @CatalogFileExists
— check whether the database is already exist and with correct version
SET NOCOUNT ON

IF DB_ID(‘SSISDB’) IS NOT NULL
    RAISERROR(27135, 16, 1, ‘SSISDB’)

IF SUBSTRING (CAST(SERVERPROPERTY(‘ProductVersion’) AS VARCHAR(20)),1,2) <> ’11’
    RAISERROR (27193,16,1, ‘Denali’) WITH NOWAIT

SELECT [value_in_use] FROM sys.configurations WHERE [name] = ‘clr enabled’
DECLARE @path nvarchar(1024) = Convert(nvarchar(1024),ServerProperty(‘MasterFile’)); SELECT @path = SUBSTRING(@path, 1, CHARINDEX(N’master.mdf’, LOWER(@path)) – 1); SELECT @path;
–DECLARE @CatalogFileExists bit
if object_id(‘tempdb..#t’) is not null drop table #t
BEGIN
       –DECLARE @CatalogFile nvarchar(1024)
       SELECT @CatalogFile = N’D:\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\SSISDB.mdf’
       cREATE TABLE #t (file_exists int, is_directory int, parent_directory_exists int)
       INSERT #t EXEC xp_fileexist @CatalogFile
       SELECT TOP 1 @CatalogFileExists=file_exists from #t
END
SELECT @CatalogFileExists
–DECLARE @CatalogFileExists bit
if object_id(‘tempdb..#t’) is not null drop table #t
BEGIN
       –DECLARE @CatalogFile nvarchar(1024)
       SELECT @CatalogFile = N’D:\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\SSISDB.ldf’
       CREATE TABLE #t (file_exists int, is_directory int, parent_directory_exists int)
       INSERT #t EXEC xp_fileexist @CatalogFile
       SELECT TOP 1 @CatalogFileExists=file_exists from #t
       DROP TABLE #t
END
SELECT @CatalogFileExists
exec sp_executesql N’RESTORE FILELISTONLY FROM DISK = @backupfile’,N’@backupfile nvarchar(67)’,@backupfile=N’C:\Program Files\Microsoft SQL Server\110\DTS\Binn\SSISDBBackup.bak’
exec sp_executesql N’RESTORE DATABASE @databaseName FROM DISK = @backupFile  WITH REPLACE, MOVE @dataName TO @dataFilePath, MOVE @logName TO @logFilePath’,N’@databaseName nvarchar(6),@dataName nvarchar(4),@dataFilePath nvarchar(65),@logName nvarchar(3),@logFilePath nvarchar(65),@backupFile nvarchar(67)’,@databaseName=N’SSISDB’,@dataName=N’data’,@dataFilePath=N’D:\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\SSISDB.mdf’,@logName=N’log’,@logFilePath=N’D:\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\SSISDB.ldf’,@backupFile=N’C:\Program Files\Microsoft SQL Server\110\DTS\Binn\SSISDBBackup.bak’
GO
————————————————————————-

USE master; 

IF EXISTS (SELECT [name] FROM sys.databases WHERE [name]=’SSISDB’ AND [is_read_only] = 1)  ALTER DATABASE [SSISDB] SET READ_WRITE WITH ROLLBACK IMMEDIATE

USE [SSISDB];

IF EXISTS (SELECT [name] FROM sys.symmetric_keys WHERE [name] = ‘##MS_DatabaseMasterKey##’)
    DROP MASTER KEY

exec sp_executesql N’USE [SSISDB];DECLARE @pwd nvarchar(4000) = REPLACE(@password, N””””, N””””””);EXEC(”CREATE MASTER KEY ENCRYPTION BY PASSWORD = ””” + @pwd + ””””);’,N’@password nvarchar(10)’,@password=N’Password@1′
GO
——————————————————————————-

USE master

   IF NOT EXISTS(SELECT * FROM sys.asymmetric_keys WHERE name = ‘MS_SQLEnableSystemAssemblyLoadingKey’)
        CREATE ASYMMETRIC KEY MS_SQLEnableSystemAssemblyLoadingKey FROM EXECUTABLE FILE = ‘C:\Program Files\Microsoft SQL Server\110\DTS\Binn\Microsoft.SqlServer.IntegrationServices.Server.dll’

   IF EXISTS(SELECT [name] FROM sys.server_principals where name = ‘##MS_SQLEnableSystemAssemblyLoadingUser##’)
        DROP LOGIN ##MS_SQLEnableSystemAssemblyLoadingUser##

  CREATE LOGIN ##MS_SQLEnableSystemAssemblyLoadingUser## FROM ASYMMETRIC KEY MS_SQLEnableSystemAssemblyLoadingKey
    GRANT UNSAFE ASSEMBLY TO ##MS_SQLEnableSystemAssemblyLoadingUser##

  IF EXISTS(SELECT name FROM sys.procedures WHERE name=N’sp_ssis_startup’)
    BEGIN
       EXEC sp_procoption N’sp_ssis_startup’,’startup’,’off’
       DROP PROCEDURE [sp_ssis_startup]
    END
GO

   CREATE PROCEDURE [dbo].[sp_ssis_startup]
    AS
    SET NOCOUNT ON
        /* Currently, the IS Store name is ‘SSISDB’ */
        IF DB_ID(‘SSISDB’) IS NULL
            RETURN
        IF NOT EXISTS(SELECT name FROM [SSISDB].sys.procedures WHERE name=N’startup’)
            RETURN
        /*Invoke the procedure in SSISDB  */
        EXEC [SSISDB].[catalog].[startup]

   IF (0=1)
    BEGIN
        /* Run sp_ssis_startup when Sql Server restarts */
        EXEC sp_procoption N’sp_ssis_startup’,’startup’,’on’
    END
GO
——————————————————————————–

–USE msdb

–IF EXISTS (SELECT name FROM sysjobs WHERE name = N’SSIS Server Maintenance Job’)
–EXEC sp_delete_job
—    @job_name = N’SSIS Server Maintenance Job’ ;

–IF EXISTS(SELECT * FROM sys.server_principals where name = ‘##MS_SSISServerCleanupJobLogin##’)
—    DROP LOGIN ##MS_SSISServerCleanupJobLogin##

–DECLARE @loginPassword nvarchar(256)
–set @loginPassword = ‘Password@1’

–SELECT @loginPassword = REPLACE (CONVERT( nvarchar(256), CRYPT_GEN_RANDOM( 64 )), N””, N”””)
–EXEC (‘CREATE LOGIN ##MS_SSISServerCleanupJobLogin## WITH PASSWORD =”’ +@loginPassword + ”’, CHECK_POLICY = ON’)

 

There, at the end of the script, is the cause of the consternation. So it can run everything successfully except for that account. I commented out those lines of code and reran the script. Success!!! AWESOME-SAUCE! One catalog ready to do some cool stuff. Note that you could also use this if you just want to create the catalog with t-sql. The last step to do is create the job manually to call those procedures and prevent SSISDB from growing too large.

SIDE NOTE: the main password for the Catalog is on line 161. Make sure you change this if you are using this to create your catalog.

USE [msdb]
GO
BEGIN TRANSACTION
DECLARE @ReturnCode INT
SELECT @ReturnCode = 0

IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]’ AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N’JOB’, @type=N’LOCAL’, @name=N'[Uncategorized (Local)]’
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

END

DECLARE @jobId BINARY(16)
EXEC @ReturnCode =  msdb.dbo.sp_add_job @job_name=N’SSIS Server Maintenance Job’,
        @enabled=1,
        @notify_level_eventlog=2,
        @notify_level_email=0,
        @notify_level_netsend=0,
        @notify_level_page=0,
        @delete_level=0,
        @description=N’Runs every day. The job removes operation records from the database that are outside the retention window and maintains a maximum number of versions per project.’,
        @category_name=N'[Uncategorized (Local)]’,
        @owner_login_name=N’DOMAIN\USERNAME‘, @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/****** Object:  Step [SSIS Server Operation Records Maintenance]    Script Date: 2/27/2015 8:17:49 AM ******/
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N’SSIS Server Operation Records Maintenance’,
        @step_id=1,
        @cmdexec_success_code=0,
        @on_success_action=3,
        @on_success_step_id=0,
        @on_fail_action=2,
        @on_fail_step_id=0,
        @retry_attempts=3,
        @retry_interval=3,
        @os_run_priority=0, @subsystem=N’TSQL’,
        @command=N’EXEC [internal].[cleanup_server_retention_window]’,
        @database_name=N’SSISDB’,
        @flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N’SSIS Server Max Version Per Project Maintenance’,
        @step_id=2,
        @cmdexec_success_code=0,
        @on_success_action=1,
        @on_success_step_id=0,
        @on_fail_action=2,
        @on_fail_step_id=0,
        @retry_attempts=3,
        @retry_interval=3,
        @os_run_priority=0, @subsystem=N’TSQL’,
        @command=N’EXEC [internal].[cleanup_server_project_version]’,
        @database_name=N’SSISDB’,
        @flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N’SSISDB Scheduler’,
        @enabled=1,
        @freq_type=4,
        @freq_interval=1,
        @freq_subday_type=1,
        @freq_subday_interval=0,
        @freq_relative_interval=0,
        @freq_recurrence_factor=0,
        @active_start_date=20001231,
        @active_end_date=99991231,
        @active_start_time=0,
        @active_end_time=120000,
        @schedule_uid=N’82cb4d40-977f-480b-9c59-bf7509e16740′
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)’
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
    IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:

GO

And there you have it. Creating the SSIS Catalog via t-sql.

 

~Chris

SSIS: Dynamically setting the Default Buffer Max Rows property

I’ve blogged in the past about the Default Buffer Max Rows property in SSIS, and how powerful it can be. As a recap, there are 2 properties in SSIS that control the amount of data that your ETL package will hold in any given buffer: The DefaultBufferMaxRows property, and the DefaultBufferSize. When performance tuning SSIS packages, we can modify either one of these to adjust how much data SSIS can carry through a buffer at once.  By default, this is 10,000 rows or 10 MB. One thing I didn’t like about my “CalculateDefaultBufferSize” spreadsheet is that it made an assumption that you were going to change your DefaultBufferSize to 100MB. So this is 100 MB per component that is running in our data flow, which can become extremely costly. If you leave it at the default however, you’re  still using 10 MB per component no matter what. If we have a really skinny data set we’re importing (such as a fact table or a hub), it’s very possible we could be leaving a lot of empty space in the data flow if we stop every 10,000 rows. Conversely,  if we have a really wide dataset that we can only fit 6,000 or 7,000 rows in a buffer, we’re having to make extra hops to move that same data. It’s always a balancing act. I recently decided to see if I could automate the setting of this property in order to maximize the efficiency of my SSIS packages without needing to manually calculate it out every time.

To start, we need to find our destination table and identify the maximum row length in that table. Fortunately, this can be obtained with a simple query (using DimCurrency in the AdventureWorksDW2012):

SELECT
    SUM (max_length)  [row_length]
FROM  sys.tables t
JOIN sys.columns c
    ON t.object_id=c.object_id
JOIN sys.schemas s
    ON t.schema_id=s.schema_id
WHERE t.name = ‘DimCurrency’
    AND s.name = ‘dbo’

You could easily turn this into a procedure to take the tablename and schema as a parameter, and return it. For this post, I’m going to leave it as a query. Once we have this, it becomes a simple exercise in math. The table above returns a max row length of 118 (in bytes), so we take this value and divide it into 1our default value (which fortunately is stored in SSIS in bytes as well – as 10485760). So our calculation becomes:

10485760/118

giving us a result of 88,862 rows. We’ll round down to the closest 100, and that will be our default buffer max rows property. Now how do we make this dynamic? To start, we open up our SSIS package and add an Execute SQL task into the Control Flow. A good item to improve upon this would be take this and put in a Pre-Execute SQL task, and write the value to an information table. Then you can log it and do some reporting around what your buffers are. As an added bonus with this dynamic structure: as your data scales, your ETL package will too!

image

Next we want to create 3 variables: str_table_name, str_schema_name, and int_row_size. str_table_name and str_schema_name should be strings, and int_row_size should be an Int32. For the int_row_size variable, set the Value to 1. This is going to avoid an error in our expression later. This way we can parameterize our inputs for our destination tables and the corresponding values, and create a placeholder for our incoming result.  If your package names are the same as the destination tables then instead of having to assign this value manually you could just configure it from an expression, using the @[System::PackageName] global variable:

image

Let’s open up our Execute SQL Task and configure it. The connection will point to our destination database, and the SQL query will be as we’ve outlined above, subbing parameters for the table and schema name:

image

Do the parameter mapping on the 2nd tab to parameterize your table and schema with the variables you created above, and set your result set to single row. The single row result set should map to your int_row_size variable. Now for the fun part. Create a data flow task, and right click on it and go to properties.

image

Under expressions, select DefaultBufferMaxRows and expand the expression ellipses. For the expression, use the following logic:

@[User::int_row_size]==1? 10000: 10485760/ @[User::int_row_size]

The initial test of the value is to ensure that if the sql fails to get the proper row size, your buffers remain at the default. And that’s it! Now you can dynamically set your default buffer max rows for ssis!

Now that’s some awesome-sauce.

Chris

SSIS: Experimenting with the target cache lookup

As a part of the project I’m currently working on, the development team is very enamored with the cache transform in SSIS. I consider myself pretty experienced with SSIS, and I’ve always stayed away components that come anywhere close to touching disk with SSIS, as I believe it slows down package performance. The cache transform component takes a query that you want (what you’re looking up) and saves it into a .caw file for SSIS to use in other locations. This is an alternative to the in memory record set destination as the result of transforms as well. As a use case, let’s say you have a customer dimension that every single fact table you load wants to reference. In the Kimball dimensional world, this would be your conformed dimensions. Do you really want to do the lookup on every package to that customer table, or would you rather save the customer records and have SSIS be able to retrieve it directly from the caw file? After much debate in the office, being the ever curious type I am, I decided to test the performance on this cache transform instead of doing the lookup on every package? Is there a point of diminishing return?

Looking at the lookup component, we have two choices on the first page when we open the command:

a) OLE DB connection manager

b) cache connection manager

Most of the time, we just leave it on the OLE DB connection manager, go to the next page, and input our command. However, there is also the cache connection manager we can use to connect. This tells SSIS to go find a cache file (what ssis internally calls a .caw file) and to use that instead of re-caching the records with every package that SSIS needs. For my example, I’ve created 3 tables: a customer dimension table with 1 million customers, a sales transaction table with 10 million sales in it, and an activity table which is tracking customer login activity to my website with another 10 million records.

The customer dimension is loaded first, and the fact tables are then generated. Upon final completion, I’ve created a master package to create parallelism:

image

Both of my fact table packages are loaded with an OLEDB Connection manager, my control flow is just a straight forward data flow task, and the data flow looks like this:

image

 

This is what we’d consider to be the “classic” approach for fact table loading. We’re pulling in the data from the source, doing the lookup to the customer dimension, and then replacing any null values farther down stream. Looking at the performance measures, the package runs these 10 million rows in 1 minute 32 seconds, which is pretty good. The lookup component in particular I’m interested in. It’s cached 1,048,540 rows and processed it in 0.391 seconds, using 37747440 bytes of memory (35.99 MB). That’s not bad. One thing to make sure of note here is that we are only pulling the lookup columns in that we need, not anything extra. This is very important for space, especially when we consider the memory SSIS is trying to use here.

Ok, so that’s our classic approach. Now let’s try using this cache transform.

On the control flow for the customer package, I’m going to add another data flow, and in it I’m going to take a source component, which I’m going to fill the query that was doing the lookup from the fact table load from before. This is just SELECT a,b FROM Customer in this case. Directly underneath it I’m going to add a cache transform, and configure a new cache connection by going to New…, and checking the Use File cache option. We’ll give the file a name, and click on columns to do the mapping. We also want to add an index on the data, so we’ll put the index position for customer_id to 1, as that’s what we’re going to be joining on later. Once this is complete my control flow will look like the below:

image

And my data flow for the Build cache is just a simple source and cache destination:

image

Alright, so now we go back to our data flows for sales and activity, and change the “LKP Customers” lookup component to a cache connection type instead of the OLE DB connection type. If you miss it, it’s on the first page under General –> Connection Type. On the connection page, select the location where you stored the customer cache. Click on columns, and make sure we’re mapping from CustomerNumber to customer_id. This should return our customer key.

So our new look package doesn’t really look any different from the data flow level. But internally, we’ve told SSIS “don’t go cache those customers every time we need to do this lookup, I’ve put it in a cache file for you. It’s on disk at this location.” Each package that runs against the lookup still caches the data in memory, it just uses this cache file you’ve built instead of pointing to the database.

Running each package individually, and then a final run of the master package results in the following table of results:

ALL TIMES IN SECONDS Customer Sales Activity Master Package
With Cache 6.7 94.3 87.5 137.2
Without Cache 3.4 92.8 85.3 141.4

It’s an interesting observation to me that overall using the cache transform doesn’t appear to give us any performance improvement gain. When running packages individually, it seems to add a few seconds to each run in order to be able to retrieve the proper results from disk. When running packages in parallel, it appears that you gain a few seconds of run time of the actual lookup packages, but do/may? lose that time again in the building of the cache. This leads me to the conclusion that it’s better to stick with the typical design pattern we see and pointing the lookup to your actual table. The issue I believe people would be tempted to use this in the first place is to prevent the occasional blocking while lookup queries are running. This is the third approach I’ve seen to solving this issue (the first being using NOLOCK as a query hint & the second being using a SNAPSHOT ISOLATION level). Use NOLOCK at your own risk. I think my personal preference after this is going to be continuing to use the SNAPSHOT ISOLATION level query hint in my query lookups. It just doesn’t seem that you’re really gaining anything with the additional development you need to do. An additional risk area that you have with building the cache files is that you now have potentially sensitive data stored in a cache file on disk. It opens up an entirely new area that you have to secure, protect, etc.

 

So my results after everything are essentially inconclusive. I’m interested in hearing additional real-world scenarios where you the readers can think of use cases for this component?

SSIS: Understanding BufferTempStoragePath

Recently while teaching the SSIS Master’s class for Pragmatic Works, one of my students asked about the buffertempstoragepath property in the data flow and how it could be used to assist in performance tuning. Since there doesn’t seem to be many articles on this property, I thought a good article on it and what it does might be helpful for some others out there who haven’t used it before.

The buffertempstoragepath property signals to SSIS where it should store buffers on disk if a memory allocation fails. This statement tells us right away that in theory, this is a property that should never ever need to be used. One of the things that makes SSIS so quick is its ability to cache data in memory and move it from place to place. This property is allocating locations on disk that can store these buffers temporarily if SSIS runs out of memory. We can watch SSIS memory usage by monitoring the BuffersInUse and Buffers Spooled counters in perfmon, which are available under the SQL Server: SSIS Pipeline XX.X(your version number here) grouping.

The buffers in use perfmon property is the number of buffers that SSIS is currently using. The Buffers spooled counter is how many buffers SSIS couldn’t process due to memory issues and spooled it to disk. How does this tie into the buffertempstoragepath property? You can think of it as when buffers spooled > 1, then the buffertempstoragepath property tells ssis where to put it. If you do run into a scenario where you may encounter a memory issue (i.e. your application has millions of rows to move and the box you’re working with is a machine with 8 GB RAM), specifying this property can help you, a little. If you do need to specify this property, then make sure you set it to a drive that has REALLY REALLY fast storage (the fastest you can get in {on?} the {USS} Enterprise). Solid states are ideal, but if that’s not possible, get the fastest you can.

There is an excellent article that Arshad Ali wrote on SSIS buffers here.

SSIS: Package has a different protection level than the project

Today I was playing around with BIML a bit and I got an interesting error message when I went to execute my package during testing. The error that popped up was:

Error    1    Project consistency check failed. The following inconsistencies were detected:     Test BIML Package.dtsx has a different ProtectionLevel than the project.         0    0    “

I checked the properties of the package I generated by clicking on the control flow, going to properties, and looking under ProtectionLevel:

image

Then I right clicked on the project level and went to properties to check the protection level of the project:

image

That checked out ok, so I thought I should check the other packages as well. I opened up another package in the project and checked the protection level:

image

Bingo. I changed the protection level to “EncryptSensitivewithUserKey”, saved it, and reran my original package. Interestingly, when I was running my package I right-clicked on it, and clicked Execute, not clicking the play debugger at the top. In 2012, since all packages are really just containers under a parent project group, the consistency check must apply to all packages in the project. I’ll have to remember that for the future.

SSIS: Removing Duplicates in a data flow

Recently I came across a post online that was providing a solution to duplicates in a data flow. As I was reading it, in it there was something that makes me cringe: a Sort transform. Those of you who know me know that the Sort transform in a data flow gives me nightmares at night, almost as much so as the Slowly Changing Dimension wizard on medium and large dimensions. As I was reading it, I kept thinking there has to be a better way than sorting the data inside of the data flow with that component. You could write a script to do it, but my personal opinion is to try and do as much in the data and control flow as possible without scripting, as I feel script components make SSIS much harder to maintain. I came up with an alternate solution, and I would like to use the rest of this blog post to prevent my findings.

For my sample tests here, I am going to use a sample file that I’ve created on customers:

image

I’m also going to make an assumption that we don’t have DQS available in this environment. If you want more information on how DQS can help with this problem, see my earlier posts here.

The first option we have is to use the Fuzzy Grouping component. Let’s start by creating a simple SSIS package to pick up from the customer csv file. I’ve created a package, the source connection csv manager, and a flat file source in my data flow. We can tell by looking at the file that Bob and Robert are 99% duplicates of either. SSN could work just as well here, or something else as long as we know it really tells who a customer is and what makes them distinct (or product, or account, or facility, etc…) Moving on! Let’s drag over and configure a fuzzy grouping component from the “Other Transforms” section of SSIS. Add the precedence constraint, and open it up.

image

Configure the OLE DB connection manager, and then go to the columns tab. We want to tell this component the same keys that we identified in the last step, What can the component use to resolve the duplicates together and come up with the distinct (in this case) customers? In our sample set, we have Telephone number. Check the box next to telephone number, and a table will appear. Under the minimum similarity column, input a value to give the component a rule to identify how similar something must be for it to be a match. Depending on your data point, this can be any value. For phone number, we’re going to say if it’s 99% match, it’s probably the same customer or household. This value will change though, it depends on your environment and your specific data point. Groupings of products, for example, may justify lowering this number to something in the 75%-80% range. My completed column looks like the below:

image 

Let’s look at what we have so far. Drag over a multicast, and enable the data viewer on it so we can see the row pass through the pipeline. Execute the package, and our data view appears:

image

I included the data flow over to the left in this screenshot, so you can also see what the package looks like. We can see the key_in and key_out fields have been added. Let’s take a conditional split component, and configure it with 2 outputs, one for the matching records and one for the other, duplicate records:

image

For the wrong data, we don’t just want to delete it out of the data flow. We want to keep tracking of it somewhere so we can see what’s being removed out of the data flow. Let’s drag over a destination component for our good and bad records, and configure them. Our finished data flow looks like the below:

image

Run the package, and let’s check our tables:

image

image

image

We can see that the bad records moved into the appropriate table for auditing, and our new clean records are now available for us to use!

SSIS – Massive Scaling & Parallelism

If you have a very large set of data or lots of files that you want to move every night, say because of thousands of flat files or terabytes and terabytes of data, having SSIS on one server just churning away most likely isn’t going to give you the performance you need. The solution here is partition out that workload across multiple servers or packages so that all of the data can be processed within the given SLA. The other day I received an email from a colleague regarding moving around 7,000 flat files nightly across to archive them. In the sensory data world, this isn’t an uncommon scenario. The current process was taking around 80 minutes, or was archiving about 3 files every 2 seconds. Simple math breaks this down for us as well:

image

The question was, how can we speed this up? Let’s come up with an SLA of, say 10 minutes. So the challenge that we have is that we want to move 7,000 files inside of 10 minutes, to keep our SLA acceptable to the client. To achieve this, we can use a workpile pattern. The workpile pattern allows us to have SSIS read from a designated “controller table” and check files in and out as it’s running through the list. To demonstrate, let’s use 4 flat files that are identical in structure:

image

We can see from the above image that I have a csv file, that consists of a code and description column. On the right are my 4 files that I wish to process.

The first thing we need to do is create the data flow portion of the package. A classic approach to this would be a straight forward for each loop container with our data flow inside: (please note this blog post does not discuss initial configuration of the foreach loop container)

image

Our main challenge here is that while this works great for our small sample set here, real life production may contain hundreds or thousands of files. Our SLA in our scenario here is we want to move 7,000 files, of anywhere from 1-1000 MB, into our database nightly. Adding a controller table, we can control the file paths and priorities of each of the files:

CREATE TABLE [dbo].[SSIS_WorkPile](
    [WorkPileKey] [int] IDENTITY(1,1) NOT NULL,
    [Filename] [varchar](100) NULL,
    [IsStarted] [tinyint] NULL,
    [IsCompleted] [tinyint] NULL,
    [Priority] [int] NULL,
    [StartTime] [datetime] NULL,
    [EndTime] [datetime] NULL,
    [LastRunNumberOfRecords] [int] NULL
)

After that, let’s insert our file names into the table:

INSERT INTO [SSIS_WorkPile].[dbo].[SSIS_WorkPile]
VALUES (‘L_AIRLINE_ID’,0,0,1,NULL,NULL,NULL)
GO
INSERT INTO [SSIS_WorkPile].[dbo].[SSIS_WorkPile]
VALUES (‘L_AIRPORT’,0,0,2,NULL,NULL,NULL)
GO
INSERT INTO [SSIS_WorkPile].[dbo].[SSIS_WorkPile]
VALUES (‘L_AIRPORT_ID’,0,0,3,NULL,NULL,NULL)
GO
INSERT INTO [SSIS_WorkPile].[dbo].[SSIS_WorkPile]
VALUES (‘L_AIRPORT_SEQ_ID’,0,0,4,NULL,NULL,NULL)
GO

Notice that I’m just inserting the file name itself, without the full path or the extension on the end. In this example, I’m putting the file path in as a variable and hard coding the csv extension on the end. If you’re scenario is different, you could modify this slightly and add another column for file extension of file path. I’ve also set the Priority starting at 1 for the first file, and IsStarted and IsCompleted to 0. I’ve left the rest of the fields null.

Now back to our package. First, let’s take out our for each loop container and replace it with a for loop container. Why is this important? If we think about the components for a second, the foreach loop container is designed to loop over a specific set of objects within a location that we specify. For Loop containers can loop over, well, anything. If we change our mindset and think about the fact that we’re no longer looping over objects, and we’re just looping over a list of rows in a SQL table, it makes sense. With that being said, let’s add a For Loop container to our package and move our objects out of the foreach loop container in:

image

We have 3 components inside of our for loop: find the file, process the file, and mark it done. Apart from the for loop container, this isn’t that much different than the process we might do for an incremental load design (find last load time, do some work, mark new load time :)) . Let’s configure our variables and the for loop container first. We have 7 variables here:

image 

Configure the filepath to be your filepath only, with the last backslash. Don’t worry about configuring the rest of them right now, it’s not important yet. Create an Execute SQL component before the for loop container and call it “SQL – Get Number of Files”. We are going to use this to identify how many total files need to be moved so the for loop container knows when to stop and doesn’t run forever. Open up the component, configure your connection, and change the ResultSet to “Single Row”. Under the SQLStatement, we are going to select the total number of rows in our command table:

SELECT COUNT(*) FROM SSIS_WorkPile.DBO.SSIS_WorkPile

Move over to the result set tab, and configure the result set mapping. Change the ResultName to 0, and select User::intNumberOfFilesToWork as the variable we want to map our result to. A completed screenshot of our configuration is below:

image image

Click Ok. Now open the for loop container, and configure the Init and Eval properties to evaluate off of the intTaskNumber variable:

image 

Click Ok, and move over to SQL Server management studio. We need to create a SQL statement that does 2 things: Firstly, we want get the next unworked file , and secondly, we want to mark that we’ve started it in order to prevent any of our other worker-bee packages from picking it up. SSIS has *ahem* difficulty accepting multiple columns back from a SQL Statement in the Execute SQL task, but wrapping it in proc allows us to generate multiple columns in our output. So we wrap it in a stored procedure, and generate two outputs: one for the key of the next file we want to work, and one for the actual file name of the file we want to get. Our stored procedure script is:

CREATE PROCEDURE [dbo].[usp_GetTask]
@intTaskNumber INT = 0 OUTPUT
, @strFileName NVARCHAR(255) = N” OUTPUT
AS
BEGIN

      UPDATE TOP (1) SSIS_WorkPile
      SET @intTaskNumber = WorkPileKey
            , @strFileName = [FileName]
            , StartTime = GETDATE()
            , IsStarted = 1
      WHERE IsStarted = 0
            AND Priority = (SELECT MIN(Priority) FROM SSIS_WorkPile WHERE IsStarted = 0)

      IF @intTaskNumber IS NULL BEGIN
            SET @intTaskNumber = 1
            SET @strFileName = ”
      END
END

GO

Go back to SSIS, and open up the first Execute SQL task. Set the result set to single row, configure your connection manager, and under the SQLStatement let’s call our stored procedure:

usp_GetTask ? OUT, ? OUT

Map your parameter return values:

image

 

Drop in a data flow, and connect the SQL component that you configured in the last step to it. We need to do 2 things here, the first is to prevent the data flow from kicking off if we get a blank filename in from the table. Right click on the precedence constraint, and select edit. Change the EvaluationOperation to “Expression and Constraint”, and in the expression enter:

@strFileName != “”

Click Ok, and open up the data flow. Create a new flat file connection, and pick one of your files. After you’ve created the connection manager, right click on it and go to the expressions property. We need to change the connection string to dynamically adjust based on what the flat file is that’s coming through the loop. Click on Expressions, select “Connection string”, and input the following into the expression window:

@[User::strFilePath] + “\\” + @[User::strFileName] + “.csv”

Do the rest of your ETL’ing that you want to do on the file, and do a multicast and point it to the destination and a row count component when you’ve finished. The row count component is important so we can track and see how many files SSIS is moving with each file into the table. My data flow looks like the below:

image

Go back to the control flow, and let’s add another SQL task. Let’s take care of the update to mark that we’ve finished that file. Go to your variables windows, and let’s configure an expression for that Update statement. The variable is our strUpdateWorkPileTableEnd. Go to properties, and change the “evaluateasexpression” property to True. In the expression window, we’ll use the following expression:

“UPDATE [dbo].[SSIS_WorkPile]
SET EndTime = GETDATE()
    ,IsCompleted = 1
      ,LastRunNumberofRecords = ” + (DT_WSTR, 50) @[User::intNumberOfRecods]
+ ” WHERE FileName = ‘” +  @[User::strFileName] + “‘”

Open up the SQL task, and configure your connection manager. Change the SQL Source Type to variable, and select your strUpdateWorkPileTableEnd from the list. The completed configured SQL task will look like the below:

image  

It’s all downhill from here!  Our completed control flow looks like this:

image

That’s it! Running these 4 files on my SSD here, they load in about .06 of a second. Wow! With my original scenario I can create a master package that just calls multiple package execution tasks calling this package or copies of this package, and scale this out! My colleague was able to meet his SLA, ad get 800% improvement over the original process. 7,000 files in less than 10 minutes! A sample master package is included with the solution download I have below. However, it is very important to remember that in SSIS, you should typically never be running more than one package per core at a time. Keep this in mind when using parallelism! Also, if you are going to create multiple copies of the package, add delay counters to kick off in the subsequent packages a few seconds after the first one, so you don’t wind up with blocking issues. If you don’t, your packages will just block each other, and spin forever.

The complete solution and database backup is available for download here.