SSIS: Experimenting with the target cache lookup

As a part of the project I’m currently working on, the development team is very enamored with the cache transform in SSIS. I consider myself pretty experienced with SSIS, and I’ve always stayed away components that come anywhere close to touching disk with SSIS, as I believe it slows down package performance. The cache transform component takes a query that you want (what you’re looking up) and saves it into a .caw file for SSIS to use in other locations. This is an alternative to the in memory record set destination as the result of transforms as well. As a use case, let’s say you have a customer dimension that every single fact table you load wants to reference. In the Kimball dimensional world, this would be your conformed dimensions. Do you really want to do the lookup on every package to that customer table, or would you rather save the customer records and have SSIS be able to retrieve it directly from the caw file? After much debate in the office, being the ever curious type I am, I decided to test the performance on this cache transform instead of doing the lookup on every package? Is there a point of diminishing return?

Looking at the lookup component, we have two choices on the first page when we open the command:

a) OLE DB connection manager

b) cache connection manager

Most of the time, we just leave it on the OLE DB connection manager, go to the next page, and input our command. However, there is also the cache connection manager we can use to connect. This tells SSIS to go find a cache file (what ssis internally calls a .caw file) and to use that instead of re-caching the records with every package that SSIS needs. For my example, I’ve created 3 tables: a customer dimension table with 1 million customers, a sales transaction table with 10 million sales in it, and an activity table which is tracking customer login activity to my website with another 10 million records.

The customer dimension is loaded first, and the fact tables are then generated. Upon final completion, I’ve created a master package to create parallelism:

image

Both of my fact table packages are loaded with an OLEDB Connection manager, my control flow is just a straight forward data flow task, and the data flow looks like this:

image

 

This is what we’d consider to be the “classic” approach for fact table loading. We’re pulling in the data from the source, doing the lookup to the customer dimension, and then replacing any null values farther down stream. Looking at the performance measures, the package runs these 10 million rows in 1 minute 32 seconds, which is pretty good. The lookup component in particular I’m interested in. It’s cached 1,048,540 rows and processed it in 0.391 seconds, using 37747440 bytes of memory (35.99 MB). That’s not bad. One thing to make sure of note here is that we are only pulling the lookup columns in that we need, not anything extra. This is very important for space, especially when we consider the memory SSIS is trying to use here.

Ok, so that’s our classic approach. Now let’s try using this cache transform.

On the control flow for the customer package, I’m going to add another data flow, and in it I’m going to take a source component, which I’m going to fill the query that was doing the lookup from the fact table load from before. This is just SELECT a,b FROM Customer in this case. Directly underneath it I’m going to add a cache transform, and configure a new cache connection by going to New…, and checking the Use File cache option. We’ll give the file a name, and click on columns to do the mapping. We also want to add an index on the data, so we’ll put the index position for customer_id to 1, as that’s what we’re going to be joining on later. Once this is complete my control flow will look like the below:

image

And my data flow for the Build cache is just a simple source and cache destination:

image

Alright, so now we go back to our data flows for sales and activity, and change the “LKP Customers” lookup component to a cache connection type instead of the OLE DB connection type. If you miss it, it’s on the first page under General –> Connection Type. On the connection page, select the location where you stored the customer cache. Click on columns, and make sure we’re mapping from CustomerNumber to customer_id. This should return our customer key.

So our new look package doesn’t really look any different from the data flow level. But internally, we’ve told SSIS “don’t go cache those customers every time we need to do this lookup, I’ve put it in a cache file for you. It’s on disk at this location.” Each package that runs against the lookup still caches the data in memory, it just uses this cache file you’ve built instead of pointing to the database.

Running each package individually, and then a final run of the master package results in the following table of results:

ALL TIMES IN SECONDS Customer Sales Activity Master Package
With Cache 6.7 94.3 87.5 137.2
Without Cache 3.4 92.8 85.3 141.4

It’s an interesting observation to me that overall using the cache transform doesn’t appear to give us any performance improvement gain. When running packages individually, it seems to add a few seconds to each run in order to be able to retrieve the proper results from disk. When running packages in parallel, it appears that you gain a few seconds of run time of the actual lookup packages, but do/may? lose that time again in the building of the cache. This leads me to the conclusion that it’s better to stick with the typical design pattern we see and pointing the lookup to your actual table. The issue I believe people would be tempted to use this in the first place is to prevent the occasional blocking while lookup queries are running. This is the third approach I’ve seen to solving this issue (the first being using NOLOCK as a query hint & the second being using a SNAPSHOT ISOLATION level). Use NOLOCK at your own risk. I think my personal preference after this is going to be continuing to use the SNAPSHOT ISOLATION level query hint in my query lookups. It just doesn’t seem that you’re really gaining anything with the additional development you need to do. An additional risk area that you have with building the cache files is that you now have potentially sensitive data stored in a cache file on disk. It opens up an entirely new area that you have to secure, protect, etc.

 

So my results after everything are essentially inconclusive. I’m interested in hearing additional real-world scenarios where you the readers can think of use cases for this component?

Microsoft Azure and Optimized VM images

One of the features I love about Microsoft Azure and using it a POC is that many of the initial upfront “battles” you have to face in an on-prem solution are already done for you. I’m going to be honest here: I really don’t like having to explain why databases need to be on separate disks, with log files broken out and tempdb on its own for the 200th time. It’s much easier to just go to Azure, create your machine and start loading it up with data. It removes much of the complexity and time delays from trying to stand it up yourself.  These images have existed in Azure for awhile, but they recently got a much needed revamp and are truly best practice configured. A high-level of some of the great benefits of this machine are:

Instant File Initialization Enabled

15 total disks – 12 data disks in one data pool, 3 log disks in another separate log pool

data_warehouse only: Recovery model set to “SIMPLE” for MODEL database using ALTER DATABASE

 

So, how do take advantage of this awesome sauce you ask? Don’t worry, I hear you. The rest of today’s post is going to be all about the initial configuration and stand up of one of these VM’s.

Let’s start by logging in to the Azure portal, and selecting VIrtual Machines from the category selection on the left hand side, and selecting “Create a virtual machine”

image

On the New dialogue page, select “From Gallery”

image

Then scroll down to the image you want to use. Here we are going to use the SQL Server 2014 enterprise optimized for data warehouse workloads image.

image

Then on the next page, we’ll configure the box by giving it a name, selecting the instance size, and creating a new user.

image

This next page is very important, as it’s where we’re going to configure external connectivity to the machine. Select “create a new cloud service”, and give it a name. Select the region, and your storage account. For the availability set, select whether you’d like to configure this VM to be a part of it. If you’re going to have more than one machine (which typically in a data warehouse we ALWAYS do), this is a very important property. Essentially, Availability sets guarantee redundancy in the event of a failure (think of that 99.95% uptime you keep hearing about in Azure marketing). The only way to guarantee this is to use an availability set. Finally, down at the bottom are you endpoints. This is the connections that can be made into the machine. By default, it’s RDP and Powershell. I’m going to want to use my local SQL Server Management Studio to connect, so I’m going to add MSSQL to the endpoints list by selecting it from the dropdown in the list.

image

On the last page, select whether you’d like to install any additional configurations on the machine. I’m going to leave mine on the default of just the VM agent, and select Finish.

image

Once you’re done, the VM will go begin it’s provisioning process:

image

Depending on the size and features you selected, this may take some time. Once it’s done we need to finish our setup so we can connect to our new awesome data warehouse from other SSMS instances. Remote to the machine by clicking Connect, and logging it with the username/password combo you provided earlier. Once it’s open, we need to open the firewall to allow the connections between our machine and the Azure environment. Under start, opens windows firewall (wf.msc) and create a new inbound rule. It will be Port, and the port is the port in the endpoint you specified in the setup (1433 in this case).

image

Select Next, and then allow the connection. CLick Next again and select if it should apply to only domain, private, or public. Ideally, yours should be in a domain in your environment. For this post, I’m just going to select Public and click Next. Public opens it up to the internet though, so you want to be really careful with this setting. Click Next, give it a name and description, and click Finish. Now, while still remoted into the box, open up SQL Server Management Studio and connect to the instance. Go to properties:

image

and select the security tab, Change the server authentication mode to SQL Server and Windows Authentication instead of just Windows auth, and click Ok. Right click on the server again and select Restart to restart the SQL Server service on the machine. Expand Security, and create a new SQL login for a user, Now, back on your local machine, open up management studio. On the connection string, pass in the DNS name of your new Azure VM (myserver.cloudapp.net), and select SQL User Authentication. Login will be the user name/password combo you just created.

 

Awesome sauce!

SQL Server 2008 R2 SP3 and SQL Server 2008 SP4

Today Microsoft released SP3 for SQL Server 2008R2 and SP4 for SQL Server 2008. If you’re running either of these platforms, you might want to go out and grab the installers to keep your environment. As these products are both now in extended support, there will be no more CU’s released for either of these products. However, both of these versions have some important security updates. Details around SP4 for 2008 are available here and details for SP3/2008R2 are here.

Once you’ve read through them, go install the SP’s! SP4/2008 can be downloaded here and SP3/2008R2 can be downloaded here.

~Chris

Using the new Qlik Sense Desktop

To all of my regular readers, I apologize for the lack of activity on this blog the last few months. With moving, new jobs, and and a new city I haven’t quite found the time to be doing writing, but hopefully now as things quiet down I can get back into the groove.

The very first BI tool I used way back in the day was Qlik Desktop. It was an old edition that the company I worked for just had laying around and that the IT department didn’t want to use anymore. After I’d installed it and started getting used to the tool, I realized that I needed to remodel the data that I was connecting to into more of a star schema and thus – my journey into data warehousing and BI started. So when I heard last week that Qlik had released their next-gen product, called Qlik Sense, I was eager to go try it out and get a feel for the tool that’s been marketed as a complete makeover of Qlik.

Qlik Sense is Qlik’s new tool for self-service analytics. The really great thing about Qlik is that unlike some of the other vendors, the full feature desktop version is free. That’s right, no cash outlay required. Qlik charges for the server version, but if you just want to download and add a feature rich application to your BI deployment, it’s easy and straight-forward to do. That being said, how good is the app? First, let’s go download the tool, located on Qlik’s site. Once you’ve downloaded it and run the installer, you’ll enter into the Qlik Sense “hub”, which is designed to be the central location for your work:

image

Click on the “Create a New app” link, and we’ll give it a name. I’m just going to call mine “MyQlikSenseTest”, and click Create. After the “successful” prompt, click “Open App”. Once you enter into the app, you’re given a choice as to whether you’ll be loading data from a flat file or from a database.

image

As a database professional, I find this…annoying. I need to go to two different places if I’m trying to join data between a flat data file or from a database. However, from a self-service user perspective this is probably a good feature as I don’t have to know the ins and outs of where to go in order to get to data readily and easily. So I think I need to think on this some more…

I’m going to click on “Data Load Editor” as I’ll be bringing data in from the Adventure Works DW database I have installed on the machine. On the main page for the data load editor, over on the right hand side we want to add the connection to import the data.

image

Click “Create New Connection”, and select the connection type you’d like.  Since SQL Server is OLE DB natively, I’ll be using that for this test.

image

Next you want to input all of the connection string information. For anyone running SQL Server 2012, make sure you select “SQL Server Native Client 11.0” instead of the default “Microsoft OLE DB Provider for SQL Server”. My completed screen looks like the below:

image

Click Save, and your connection will appear over on the right hand side of the screen:

image

The middle icon allows you to select the data that you want. If we click it, we’ll get a data exploration tab that can be used to explore and view the data prior to importing it. I’m going to select where the owner is dbo, and get a list of all of the tables. I do wish that it would default to that, I can see trying to explain what dbo is to users and getting them to click it would/will? be a pain. I do really like the filter table feature, that’s nice.

image

Let’s select FactInternetSales, DimCustomer, DimDate, DimSalesTerritory, DimProduct, DimProductCategory, and DimProductSubCategory, and click “Insert Script”. In the script pane in the middle, find the section loading the factinternetsales tables and find the OrderDateKey column. Change it from OrderDateKey to just DateKey and in the Select statement below, add an alias for the OrderDateKey as DateKey so the load will function. One nice thing is that if you click on a table name, you can go through and select just the columns you want to use for your analysis. This is a nice feature, something very similar to what you can do in Power Pivot. Up at the top right is a “Load Data” icon. Clicking it will import all of the data into the interface. Now we have some data we can work with. Once you load it, switch the view up in the top left of the screen by clicking on the compass icon and selecting data model viewer. There should be a relationship between the fact table and all of the related tables.

image

One thing I can find an issue with on this page is that you can’t define your own relationships, you have to do it in the loading layer. That’s why we had to change “OrderDateKey” to “DateKey” in the load script. This means that if you have a typical role-playing scenario with dates (order date, ship date, due date) you’d have to bring the date table over 3 times and rename it to each of these columns in order to get the engine to do the mapping. When you get to year over year calculations and other time intelligence functions, this could become very time consuming. I don’t know all of the internals of Qlik, but I do wonder how this affects the engine’s ability to handle data and puts more pressure on it unnecessarily. Once you’ve confirmed your relationship, click on the Compass again and go to the App Overview page. This is where you create your dashboard sheets, Click on “create new sheet”, and give it a title. Click on the icon to the left and you’ll enter the sheet design page. Click the pencil icon up at the top of the page, and you’ll have an empty page you can use to start designing:

image

Starting with a simple bar chart, it’s just a drag and drop from the icon into the main part of the page. After you drag it over, the chart asks you to add a dimension and a measure. We’ll select sales territory country as our dimension and sales amount as the measure, resulting in the following screen:

image

One thing I love about this is that it’s so EASY! It’s so convenient to just click the AddDimension link and then click the file you want. Adding a line chart, we can apply the same premise, by clicking the calendaryear dimension attribute and selecting salesamount again as the measure. My final desktop screenshot looks like the below:

image

When you’re done, click done up at the top right of the page, where the edit icon used to be. With Qlik, if you want to publish it you have to have Qlik server, which requires a license.

In summary, importing data doesn’t seem to have changed much from the old Qlik Desktop in terms of syntax, but it has a nicer and fancier interface which should appeal to the masses. In my limited review of the product, it looks nice and I can see why Qlik is proud of it. I do like it as a tool and can see where companies can benefit from it. My personal opinion is that it brings Qlik up to speed with the Tableau/Power View contingent of the world, but I didn’t see anything truly groundbreaking and revolutionary in the product. Another BI tool into an increasingly saturated market. A major selling point that is awesome about this product though is that the desktop edition is free, you’re just paying for the server, unlike the Tableau’s of the world which charge you for both the desktop and server versions.

The tides of change

“All great changes are preceded by chaos.” -Deepak Chopra

This quote has always stood out to me, but bears a special meaning right now. Recently, I’ve accepted a position with an IT Consulting company in Columbus, OH. I’m really looking forward to the challenges that await! But, this entails a pretty big switch. Some of you may know that just last summer we moved from beautiful Denver, CO to Raleigh, NC. Now not even a year later we’re packing it all up and doing it again, this time off to Ohio. We’re really excited about the switch, and I’m excited to get into a great SQL community that’s present throughout that entire Ohio valley. Additionally, I’ll be able to be with my kids a bit more, and as my oldest is starting Kindergarten in the fall, the ability to be home some more is just awesome. I’m also looking forward to going to a lot more user groups all around the Ohio area since I’ll be in town during the week, and getting some time to work on some new presentations and refine my old ones. With my new position at Centric Consulting, I’ll be taking an architect position and helping steer engagements for them across all of their local Business Units in the country.

 

And, if you live in the Ohio area, I look forward to meeting you at an event soon!

P.S. If posts are a little slow over the next few weeks & months I apologize in advance! We’ll be packing up our house, two kids, and doing all those fun moving things!

 

~Chris

Connecting SSAS Tabular to Excel Files in SharePoint

One of my absolute favorite features in tabular models is the ease in which we can do things like discretization and grouping. What used to take days in Multidimensional can now be accomplished in just minutes. Additionally, we can now give full control of these groups to business users to be able to dynamically adjust these groups in Excel, and import these into our tabular models. I did a post a while back about setting up discretization buckets in tabular, but I was recently showing this to a client and the thought occurred to me that we could create a SharePoint library to house these workbooks instead of having to keep them on a fileshare. Placing them into SharePoint allows us to set security, adjust it in the browser, and just get some other awesome sauce into our model. This should be a pretty short post, but it’s really cool so bear with me. 🙂

 

First, create a SharePoint library to house you’re workbook. Truthfully, any library will do but long term you’re going to want to restrict this library down to just a few individuals who can access it. Once your library is created, upload the workbook. Now select the workbook, and go up to Library in the browser options. About halfway across the top of the page, there is an option to “Open in Explorer”

image

 

When you select that, your workbook will open up in windows explorer and you can then grab the connection string. Right click on the file, and go to properties. You want to grab the location section, and then add in the workbook at the end of it. Notice that SharePoint files have an extra section in the filepath that must be there in order for it to pull.

image

Copy it, and move over to tabular. In the tabular model, create a new connection to an excel workbook, and then paste the filepath you just copied in. Click through and you’re good to go. Now when users want to modify groups they can do so easily from within SharePoint, and you’re tabular model can just pull that data through. Awesome-sauce!

Power Pivot: Managing Refresh

Power Pivot workbooks need to be refreshed on a schedule. One of the coolest things about Power Pivot is that all that time that users had to spend gathering and scraping data from here, there, and everywhere is how seamless Power Pivot makes it. So why do all of that work to gather all that data to have it be static in your model. I don’t know anyone that wakes up every day thinking they just can’t wait to get into the office to click that magic refresh button. Fortunately, once you’ve loaded your Power Pivot workbook into your SharePoint Power Pivot gallery it’s easy to manage the day to day refresh schedule.

 

Start by opening up your Power Pivot workbook. You should see a little calendar icon at the top right of the viewing area:

image

 

If you click on it, it will open up a data refresh management screen:

 

image

First, check the Enable  box. This will turn the refresh capabilities on for that workbook.

Then select how often you want the workbook refreshed. Daily, Weekly, Monthly, or once. (Personally, I’m not sure of the once option?) We’ll set this workbook to refresh every day. Leave it as Daily,  and then we’ll change the right side to every weekday.

earliest start time tells the workbook when to refresh. If you select After Business Hours, SharePoint will use the business hours set by your admin and determine when to refresh based on availability. To me, this is like telling SSIS “yeah, just anytime you see fit between 9 PM and 6 AM, go for it. You kids have fun!” Uhh…no. Let’s change it to a specific earliest start time, and set a time of day.

email notifications tells the workbook who it should email if the unthinkable happens and it fails the refresh process. I usually put my email address in here, you might yours, the users, or your boss’. (I’d caution STRONGLY against that last one :))

credentials  allows you to set what user you want to run the refresh process. You might set it refresh off of your credentials, a specific account already created in the Secure Store Service, or the account the SharePoint admin configured for this refresh process already.

image

 

Data sources lets you select what Data sources you want to refresh. If you have a Power Pivot workbook pointing to multiple locations, you might choose to refresh one at this time of day and the other at a different time of the day (on another refresh schedule).

image

Side note: Please forgive the fact that my blacked out server name looks like my 2 year old did it, I’m using the trackpad on my laptop while sitting in the airport to do it

 

Once you’re all set and you’re happy with your settings, click OK. You’ll be returned back to the Power Pivot gallery with a workbook scheduled for some refreshed awesome-sauce!