As a part of the project I’m currently working on, the development team is very enamored with the cache transform in SSIS. I consider myself pretty experienced with SSIS, and I’ve always stayed away components that come anywhere close to touching disk with SSIS, as I believe it slows down package performance. The cache transform component takes a query that you want (what you’re looking up) and saves it into a .caw file for SSIS to use in other locations. This is an alternative to the in memory record set destination as the result of transforms as well. As a use case, let’s say you have a customer dimension that every single fact table you load wants to reference. In the Kimball dimensional world, this would be your conformed dimensions. Do you really want to do the lookup on every package to that customer table, or would you rather save the customer records and have SSIS be able to retrieve it directly from the caw file? After much debate in the office, being the ever curious type I am, I decided to test the performance on this cache transform instead of doing the lookup on every package? Is there a point of diminishing return?
Looking at the lookup component, we have two choices on the first page when we open the command:
a) OLE DB connection manager
b) cache connection manager
Most of the time, we just leave it on the OLE DB connection manager, go to the next page, and input our command. However, there is also the cache connection manager we can use to connect. This tells SSIS to go find a cache file (what ssis internally calls a .caw file) and to use that instead of re-caching the records with every package that SSIS needs. For my example, I’ve created 3 tables: a customer dimension table with 1 million customers, a sales transaction table with 10 million sales in it, and an activity table which is tracking customer login activity to my website with another 10 million records.
The customer dimension is loaded first, and the fact tables are then generated. Upon final completion, I’ve created a master package to create parallelism:
Both of my fact table packages are loaded with an OLEDB Connection manager, my control flow is just a straight forward data flow task, and the data flow looks like this:
This is what we’d consider to be the “classic” approach for fact table loading. We’re pulling in the data from the source, doing the lookup to the customer dimension, and then replacing any null values farther down stream. Looking at the performance measures, the package runs these 10 million rows in 1 minute 32 seconds, which is pretty good. The lookup component in particular I’m interested in. It’s cached 1,048,540 rows and processed it in 0.391 seconds, using 37747440 bytes of memory (35.99 MB). That’s not bad. One thing to make sure of note here is that we are only pulling the lookup columns in that we need, not anything extra. This is very important for space, especially when we consider the memory SSIS is trying to use here.
Ok, so that’s our classic approach. Now let’s try using this cache transform.
On the control flow for the customer package, I’m going to add another data flow, and in it I’m going to take a source component, which I’m going to fill the query that was doing the lookup from the fact table load from before. This is just SELECT a,b FROM Customer in this case. Directly underneath it I’m going to add a cache transform, and configure a new cache connection by going to New…, and checking the Use File cache option. We’ll give the file a name, and click on columns to do the mapping. We also want to add an index on the data, so we’ll put the index position for customer_id to 1, as that’s what we’re going to be joining on later. Once this is complete my control flow will look like the below:
And my data flow for the Build cache is just a simple source and cache destination:
Alright, so now we go back to our data flows for sales and activity, and change the “LKP Customers” lookup component to a cache connection type instead of the OLE DB connection type. If you miss it, it’s on the first page under General –> Connection Type. On the connection page, select the location where you stored the customer cache. Click on columns, and make sure we’re mapping from CustomerNumber to customer_id. This should return our customer key.
So our new look package doesn’t really look any different from the data flow level. But internally, we’ve told SSIS “don’t go cache those customers every time we need to do this lookup, I’ve put it in a cache file for you. It’s on disk at this location.” Each package that runs against the lookup still caches the data in memory, it just uses this cache file you’ve built instead of pointing to the database.
Running each package individually, and then a final run of the master package results in the following table of results:
|ALL TIMES IN SECONDS||Customer||Sales||Activity||Master Package|
It’s an interesting observation to me that overall using the cache transform doesn’t appear to give us any performance improvement gain. When running packages individually, it seems to add a few seconds to each run in order to be able to retrieve the proper results from disk. When running packages in parallel, it appears that you gain a few seconds of run time of the actual lookup packages, but do/may? lose that time again in the building of the cache. This leads me to the conclusion that it’s better to stick with the typical design pattern we see and pointing the lookup to your actual table. The issue I believe people would be tempted to use this in the first place is to prevent the occasional blocking while lookup queries are running. This is the third approach I’ve seen to solving this issue (the first being using NOLOCK as a query hint & the second being using a SNAPSHOT ISOLATION level). Use NOLOCK at your own risk. I think my personal preference after this is going to be continuing to use the SNAPSHOT ISOLATION level query hint in my query lookups. It just doesn’t seem that you’re really gaining anything with the additional development you need to do. An additional risk area that you have with building the cache files is that you now have potentially sensitive data stored in a cache file on disk. It opens up an entirely new area that you have to secure, protect, etc.
So my results after everything are essentially inconclusive. I’m interested in hearing additional real-world scenarios where you the readers can think of use cases for this component?