Data Warehouse/Mart Loading: An alternate loading approach (HASHBYTES)

When loading dimensions into a warehouse, one of the more time consuming pieces of development is the comparison between the source data, and the destination dimension. Additionally, comparing the source against the destination with all of those OR statements is time consuming on the engine, too. As ETL developers, we can make this process much simpler and easier to develop. By streamlining this process, not only can we make our development time quicker, but it’s less error prone & more efficient too. This “secret sauce” is called the hashbytes function in SQL, and it works by combining all of the data points in our source together, and generating a hash that represents it, making for a simpler comparison. As we generate this same hash against the dimension as well, we can then compare 1 column in the source to 1 column in the destination. To demonstrate, let’s take the product dimension in the Adventure Works data warehouse and explore it in a little more detail:

This dimension consists of 36 columns, coming from a couple of tables in the Adventure Works OLTP database, starting with Production.Product and joining some additional tables in. For simplicity sake, I am just going to use the Production.Product table. Let’s assume that after assessment we decide that the best ETL pattern for our development consists of a design using a lookup to the existing dimension, with a conditional split for changes, and then our destinations:

image

 (side note: in an actual production environment, you probably wouldn’t use the OLE DB command, and would have it replaced with a staging table and update statement)

Looking at the conditional split in a little more detail,

image

 

We can see two things: one, that we had to output every column from our dimension, appending something like _DW to the column for comparison, and two, that condition looks like it was a &!#$ to write! And regrettably, this is a true statement. Enough columns, one error, and you can spend hours trying to get that one statement working just right. Instead of writing that horrendous Changed statement, fortunately we have an alternative, the HASHBYTES function.

The HASHBYTES function was first introduced in SQL Server 2005, and can accept different hashes based on how we want to output it. There are 7 different outputs, and the function accepts two arguments, the first being that hash type you want, and the second being the columns you want to generate the hash on. A simple hash would look like:

HASHBYTES(‘’SHA1’, Product.ProductName) AS [Hash]

In a real world scenario, we need to handle null values and ensure that our data types are consistent along every column. Generating a hash against a null is going to result in… NULL, and different data types will cause different results to render. To start, create a hash on the source data consisting of the data you want to compare. For this demo, let’s just create a hash on the first three columns, but you can expand this out as much as you need. A simple hash on the Product source table would look like this:

CAST(HASHBYTES(‘SHA1’, ISNULL(CAST([Class] AS VARCHAR), ‘UNKNOWN’)
            + ‘|’ + ISNULL(CAST([Listprice] AS VARCHAR), ‘UNKNOWN’)
            + ‘|’ + ISNULL(CAST([ProductLine] AS VARCHAR), ‘UNKNOWN’)
            ) AS BIGINT) AS [SourceHash]

We need to cast the whole hash as a bigint in order for our comparison to work later in SSIS. Now that we have this, we can input it our source select statement. In Visual Studio, open up the source, and we’ll replace our original source with the original columns plus our hash.

image 

Then in the Lookup component, add the same function, but make sure that the column names point to your dimension table (or whatever you’re looking up to). Once you’ve got your statement right, add it as an additional output along with your surrogate key:

image

Remember up above in our conditional split for all of the changed records? Change it so that it is simply a [SourceHash] != [DestinationHash]. Our new conditional split looks like this:

image

And our complete data flow looks exactly the same as before! This is a great way to keep the speed of development up. Note that you could also do this even if you have both type 1 and type 2 attributes in your dimension, you would just generate one hash consisting of only the type 1 attributes and then another column consisting of only the type 2 attributes.

Below is a table of the algorithms and the different outputs:

image 

If you want more info on hashing from a security perspective in SQL Server, Don Pinto has a good article on it here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s