SSAS: DistinctCount

As I’m sure all experienced SSAS developers have noticed, when you create distinct count measures in SSAS multidimensional, you are required to create a separate measure group in order to let Analysis Services identify these calculations easier. DIstinct counts are one of the most demanding requests that can be placed on a MOLAP database, caused by the large number of dimensional attributes that typically exist in a cube. I was recently reading the Analysis Services Distinct Count Optimization whitepaper by Denny Lee, and wanted to see for myself how changing the partitioning strategy for a distinct count measure group could affect performance, both from a processing and querying point of view. I enlarged the FactInternetSales table in AdventureWorks to ~1 million records, and took the existing Adventure Works multidimensional solution and just added my new big internet sales table into it. In full disclosure, the machine I’m working on is an Intel I7, dual-core, with 16 GB of RAM, and a 500GB SSD. I decided to run 3 tests, the first with the initial default partition, the second with 4 partitions, one for each year, and the third one creating equal distributions by the size of the table. Onwards!

To start, I’ve turned profiler on against my target server and created an 2 measure groups against my FactInternetSalesBig table, one calculating the sum of all of the sales amount and one doing the distinct count on my customerkey, representing the distinct number of customers. Below is the screenshot of the partition screen:

image

One partition results in the following query being sent against the database during processing, requesting an ORDER BY, as we’d expect.

        SELECT [dbo_FactInternetSalesBig].[CustomerKey] AS [dbo_FactInternetSalesBigCustomerKey0_0],[dbo_FactInternetSalesBig].[OrderDateKey] AS [dbo_FactInternetSalesBigOrderDateKey0_1],[dbo_FactInternetSalesBig].[DueDateKey] AS [dbo_FactInternetSalesBigDueDateKey0_2],[dbo_FactInternetSalesBig].[ShipDateKey] AS [dbo_FactInternetSalesBigShipDateKey0_3]
FROM [dbo].[FactInternetSalesBig] AS [dbo_FactInternetSalesBig]
ORDER BY [dbo_FactInternetSalesBig].[CustomerKey]
ASC

Additionally, this took 11622 milliseconds to process. What happens when we partition by year?

image

This time, profiler shows us the 4 separate queries being fired off into the relational database, each with their own order by statement. The longest partition, 2008, took 5111 milliseconds to process.

image

Our issue here is that when Analysis Services is reading from each partition, they are unequal in size, as evidenced by a graphic similar to the below:

image

pulled from the SSAS Distinct Count Optimizations Guide, written by Denny Lee

 

Grouping by year causes wide differences between the partitions files stored on disk, causing analysis services to read many more records for say, 2007 than 2005. We want to take this same data set, and group them into equal buckets each, in order to make it easier for Analysis Services to do the calculations at query time. Our ideal image should look like the below (adopted from the same source as the image above). This also ensures that AS only needs to scan each partition once, by creating non-intersecting partitions.

image

How can we ensure that partitions are of equal size? If we return back to our original table that is generating the fact, we can run a query like the below which tells us the min and max range of the table, as well as how many records should be in each group:

SELECT
COUNT (DISTINCT f.SalesOrderNumber) as CustomerKeyCount,
MIN(f.CustomerKey) as MinCustomerKey,
MAX(f.CustomerKey) as MaxCustomerKey,
(MAX(f.CustomerKey) – MIN(f.CustomerKey)) / 4 as CustomerKeySlice
FROM FactInternetSalesBig f

This table yields the following results in my big internet sales table:

image

In the fourth column of the select, I’ve hardcoded the 4, stating that I want to create 4 partitions. I’m now going to create 4 equal sized partitions in AS by creating each range, starting with the min value and incrementing by 4620 for each partition. Doing this takes 4883 milliseconds to process. The results can be represented via the following below graphic:

image

So, creating 4 equal partitions can help us scale our distinct count queries more and giving AS a really good performance boost when these queries come into play. Thanks Denny!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s