SSIS – Massive Scaling & Parallelism

If you have a very large set of data or lots of files that you want to move every night, say because of thousands of flat files or terabytes and terabytes of data, having SSIS on one server just churning away most likely isn’t going to give you the performance you need. The solution here is partition out that workload across multiple servers or packages so that all of the data can be processed within the given SLA. The other day I received an email from a colleague regarding moving around 7,000 flat files nightly across to archive them. In the sensory data world, this isn’t an uncommon scenario. The current process was taking around 80 minutes, or was archiving about 3 files every 2 seconds. Simple math breaks this down for us as well:

image

The question was, how can we speed this up? Let’s come up with an SLA of, say 10 minutes. So the challenge that we have is that we want to move 7,000 files inside of 10 minutes, to keep our SLA acceptable to the client. To achieve this, we can use a workpile pattern. The workpile pattern allows us to have SSIS read from a designated “controller table” and check files in and out as it’s running through the list. To demonstrate, let’s use 4 flat files that are identical in structure:

image

We can see from the above image that I have a csv file, that consists of a code and description column. On the right are my 4 files that I wish to process.

The first thing we need to do is create the data flow portion of the package. A classic approach to this would be a straight forward for each loop container with our data flow inside: (please note this blog post does not discuss initial configuration of the foreach loop container)

image

Our main challenge here is that while this works great for our small sample set here, real life production may contain hundreds or thousands of files. Our SLA in our scenario here is we want to move 7,000 files, of anywhere from 1-1000 MB, into our database nightly. Adding a controller table, we can control the file paths and priorities of each of the files:

CREATE TABLE [dbo].[SSIS_WorkPile](
    [WorkPileKey] [int] IDENTITY(1,1) NOT NULL,
    [Filename] [varchar](100) NULL,
    [IsStarted] [tinyint] NULL,
    [IsCompleted] [tinyint] NULL,
    [Priority] [int] NULL,
    [StartTime] [datetime] NULL,
    [EndTime] [datetime] NULL,
    [LastRunNumberOfRecords] [int] NULL
)

After that, let’s insert our file names into the table:

INSERT INTO [SSIS_WorkPile].[dbo].[SSIS_WorkPile]
VALUES (‘L_AIRLINE_ID’,0,0,1,NULL,NULL,NULL)
GO
INSERT INTO [SSIS_WorkPile].[dbo].[SSIS_WorkPile]
VALUES (‘L_AIRPORT’,0,0,2,NULL,NULL,NULL)
GO
INSERT INTO [SSIS_WorkPile].[dbo].[SSIS_WorkPile]
VALUES (‘L_AIRPORT_ID’,0,0,3,NULL,NULL,NULL)
GO
INSERT INTO [SSIS_WorkPile].[dbo].[SSIS_WorkPile]
VALUES (‘L_AIRPORT_SEQ_ID’,0,0,4,NULL,NULL,NULL)
GO

Notice that I’m just inserting the file name itself, without the full path or the extension on the end. In this example, I’m putting the file path in as a variable and hard coding the csv extension on the end. If you’re scenario is different, you could modify this slightly and add another column for file extension of file path. I’ve also set the Priority starting at 1 for the first file, and IsStarted and IsCompleted to 0. I’ve left the rest of the fields null.

Now back to our package. First, let’s take out our for each loop container and replace it with a for loop container. Why is this important? If we think about the components for a second, the foreach loop container is designed to loop over a specific set of objects within a location that we specify. For Loop containers can loop over, well, anything. If we change our mindset and think about the fact that we’re no longer looping over objects, and we’re just looping over a list of rows in a SQL table, it makes sense. With that being said, let’s add a For Loop container to our package and move our objects out of the foreach loop container in:

image

We have 3 components inside of our for loop: find the file, process the file, and mark it done. Apart from the for loop container, this isn’t that much different than the process we might do for an incremental load design (find last load time, do some work, mark new load time :)) . Let’s configure our variables and the for loop container first. We have 7 variables here:

image 

Configure the filepath to be your filepath only, with the last backslash. Don’t worry about configuring the rest of them right now, it’s not important yet. Create an Execute SQL component before the for loop container and call it “SQL – Get Number of Files”. We are going to use this to identify how many total files need to be moved so the for loop container knows when to stop and doesn’t run forever. Open up the component, configure your connection, and change the ResultSet to “Single Row”. Under the SQLStatement, we are going to select the total number of rows in our command table:

SELECT COUNT(*) FROM SSIS_WorkPile.DBO.SSIS_WorkPile

Move over to the result set tab, and configure the result set mapping. Change the ResultName to 0, and select User::intNumberOfFilesToWork as the variable we want to map our result to. A completed screenshot of our configuration is below:

image image

Click Ok. Now open the for loop container, and configure the Init and Eval properties to evaluate off of the intTaskNumber variable:

image 

Click Ok, and move over to SQL Server management studio. We need to create a SQL statement that does 2 things: Firstly, we want get the next unworked file , and secondly, we want to mark that we’ve started it in order to prevent any of our other worker-bee packages from picking it up. SSIS has *ahem* difficulty accepting multiple columns back from a SQL Statement in the Execute SQL task, but wrapping it in proc allows us to generate multiple columns in our output. So we wrap it in a stored procedure, and generate two outputs: one for the key of the next file we want to work, and one for the actual file name of the file we want to get. Our stored procedure script is:

CREATE PROCEDURE [dbo].[usp_GetTask]
@intTaskNumber INT = 0 OUTPUT
, @strFileName NVARCHAR(255) = N” OUTPUT
AS
BEGIN

      UPDATE TOP (1) SSIS_WorkPile
      SET @intTaskNumber = WorkPileKey
            , @strFileName = [FileName]
            , StartTime = GETDATE()
            , IsStarted = 1
      WHERE IsStarted = 0
            AND Priority = (SELECT MIN(Priority) FROM SSIS_WorkPile WHERE IsStarted = 0)

      IF @intTaskNumber IS NULL BEGIN
            SET @intTaskNumber = 1
            SET @strFileName = ”
      END
END

GO

Go back to SSIS, and open up the first Execute SQL task. Set the result set to single row, configure your connection manager, and under the SQLStatement let’s call our stored procedure:

usp_GetTask ? OUT, ? OUT

Map your parameter return values:

image

 

Drop in a data flow, and connect the SQL component that you configured in the last step to it. We need to do 2 things here, the first is to prevent the data flow from kicking off if we get a blank filename in from the table. Right click on the precedence constraint, and select edit. Change the EvaluationOperation to “Expression and Constraint”, and in the expression enter:

@strFileName != “”

Click Ok, and open up the data flow. Create a new flat file connection, and pick one of your files. After you’ve created the connection manager, right click on it and go to the expressions property. We need to change the connection string to dynamically adjust based on what the flat file is that’s coming through the loop. Click on Expressions, select “Connection string”, and input the following into the expression window:

@[User::strFilePath] + “\\” + @[User::strFileName] + “.csv”

Do the rest of your ETL’ing that you want to do on the file, and do a multicast and point it to the destination and a row count component when you’ve finished. The row count component is important so we can track and see how many files SSIS is moving with each file into the table. My data flow looks like the below:

image

Go back to the control flow, and let’s add another SQL task. Let’s take care of the update to mark that we’ve finished that file. Go to your variables windows, and let’s configure an expression for that Update statement. The variable is our strUpdateWorkPileTableEnd. Go to properties, and change the “evaluateasexpression” property to True. In the expression window, we’ll use the following expression:

“UPDATE [dbo].[SSIS_WorkPile]
SET EndTime = GETDATE()
    ,IsCompleted = 1
      ,LastRunNumberofRecords = ” + (DT_WSTR, 50) @[User::intNumberOfRecods]
+ ” WHERE FileName = ‘” +  @[User::strFileName] + “‘”

Open up the SQL task, and configure your connection manager. Change the SQL Source Type to variable, and select your strUpdateWorkPileTableEnd from the list. The completed configured SQL task will look like the below:

image  

It’s all downhill from here!  Our completed control flow looks like this:

image

That’s it! Running these 4 files on my SSD here, they load in about .06 of a second. Wow! With my original scenario I can create a master package that just calls multiple package execution tasks calling this package or copies of this package, and scale this out! My colleague was able to meet his SLA, ad get 800% improvement over the original process. 7,000 files in less than 10 minutes! A sample master package is included with the solution download I have below. However, it is very important to remember that in SSIS, you should typically never be running more than one package per core at a time. Keep this in mind when using parallelism! Also, if you are going to create multiple copies of the package, add delay counters to kick off in the subsequent packages a few seconds after the first one, so you don’t wind up with blocking issues. If you don’t, your packages will just block each other, and spin forever.

The complete solution and database backup is available for download here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s