As I’m sure all experienced SSAS developers have noticed, when you create distinct count measures in SSAS multidimensional, you are required to create a separate measure group in order to let Analysis Services identify these calculations easier. DIstinct counts are one of the most demanding requests that can be placed on a MOLAP database, caused by the large number of dimensional attributes that typically exist in a cube. I was recently reading the Analysis Services Distinct Count Optimization whitepaper by Denny Lee, and wanted to see for myself how changing the partitioning strategy for a distinct count measure group could affect performance, both from a processing and querying point of view. I enlarged the FactInternetSales table in AdventureWorks to ~1 million records, and took the existing Adventure Works multidimensional solution and just added my new big internet sales table into it. In full disclosure, the machine I’m working on is an Intel I7, dual-core, with 16 GB of RAM, and a 500GB SSD. I decided to run 3 tests, the first with the initial default partition, the second with 4 partitions, one for each year, and the third one creating equal distributions by the size of the table. Onwards!
To start, I’ve turned profiler on against my target server and created an 2 measure groups against my FactInternetSalesBig table, one calculating the sum of all of the sales amount and one doing the distinct count on my customerkey, representing the distinct number of customers. Below is the screenshot of the partition screen:
One partition results in the following query being sent against the database during processing, requesting an ORDER BY, as we’d expect.
SELECT [dbo_FactInternetSalesBig].[CustomerKey] AS [dbo_FactInternetSalesBigCustomerKey0_0],[dbo_FactInternetSalesBig].[OrderDateKey] AS [dbo_FactInternetSalesBigOrderDateKey0_1],[dbo_FactInternetSalesBig].[DueDateKey] AS [dbo_FactInternetSalesBigDueDateKey0_2],[dbo_FactInternetSalesBig].[ShipDateKey] AS [dbo_FactInternetSalesBigShipDateKey0_3]
FROM [dbo].[FactInternetSalesBig] AS [dbo_FactInternetSalesBig]
ORDER BY [dbo_FactInternetSalesBig].[CustomerKey]
Additionally, this took 11622 milliseconds to process. What happens when we partition by year?
This time, profiler shows us the 4 separate queries being fired off into the relational database, each with their own order by statement. The longest partition, 2008, took 5111 milliseconds to process.
Our issue here is that when Analysis Services is reading from each partition, they are unequal in size, as evidenced by a graphic similar to the below:
pulled from the SSAS Distinct Count Optimizations Guide, written by Denny Lee
Grouping by year causes wide differences between the partitions files stored on disk, causing analysis services to read many more records for say, 2007 than 2005. We want to take this same data set, and group them into equal buckets each, in order to make it easier for Analysis Services to do the calculations at query time. Our ideal image should look like the below (adopted from the same source as the image above). This also ensures that AS only needs to scan each partition once, by creating non-intersecting partitions.
How can we ensure that partitions are of equal size? If we return back to our original table that is generating the fact, we can run a query like the below which tells us the min and max range of the table, as well as how many records should be in each group:
COUNT (DISTINCT f.SalesOrderNumber) as CustomerKeyCount,
MIN(f.CustomerKey) as MinCustomerKey,
MAX(f.CustomerKey) as MaxCustomerKey,
(MAX(f.CustomerKey) – MIN(f.CustomerKey)) / 4 as CustomerKeySlice
FROM FactInternetSalesBig f
This table yields the following results in my big internet sales table:
In the fourth column of the select, I’ve hardcoded the 4, stating that I want to create 4 partitions. I’m now going to create 4 equal sized partitions in AS by creating each range, starting with the min value and incrementing by 4620 for each partition. Doing this takes 4883 milliseconds to process. The results can be represented via the following below graphic:
So, creating 4 equal partitions can help us scale our distinct count queries more and giving AS a really good performance boost when these queries come into play. Thanks Denny!