Tabular Basket Analysis: Simple Data Mining in Tabular databases

Before I get to today’s post, on a personal note I apologize for the lack of activity lately on this blog. I’ve recently moved across the country from Colorado to North Carolina, and between trying to get some things in order and my day to day work responsibilities, I didn’t have enough time in the day. I’m good now, though, so let’s get to it! 🙂

 

I was recently reading Alberto’s amazingly awesome piece on basket analysis in tabular in DAX (available here), but I had some issues implementing it on my own. I wanted to blog about this in case anyone else ran into the same issues I did when trying to implement it. At its core what the question is really asking is “Of everyone that bought Product X, who didn’t buy Product Y”? This can be a very tricky question to answer from a business intelligence perspective, but the value it gives to the business is amazingly useful. Firstly, some background. To keep this simple I am going to be using the same data set Alberto did, on Internet Sales in Adventure Works. The specific question we want to answer is: “Of everyone that purchased a mountain bike, who didn’t buy a tire accessory?” My tabular model that I use for all of my dax stuff is pretty big, but the sub-model snippet we want to initially look at looks like:

 

image 

 

As Alberto highlights, this only answers one part of the question, which is these are the products the customer actually purchased. To answer the 2nd part of the question, which is “who didn’t buy tire tubes”, we want to bring in another instance of the product dimension as a role playing dimension that we can utilize separately. There are now 2 product dimensions in our diagram, one which a relationship to the internet sales and one that has no relationship at all. If you think about it that makes sense too as the product key in the Internet Sales table wouldn’t join to the filtered products dimension as there is no associated product sold. Our new diagram looks like:

 

image

 

Let’s go into the new Product table that we just imported and rename the EnglishProductName field to Analyzed Product:

image

 

Now we can create a formula to return what was bought out of the product we want to analyze. Go to the Internet Sales table, and our formula will look like this:

Number Sales:=CALCULATE (
    COUNTROWS (DISTINCT (‘Internet Sales'[CustomerKey])),
    FILTER (
        ALL (‘Date’),
        ‘Date'[DateKey] <= MAX (‘Date'[DateKey])
    ),
    FILTER (
        ‘Customer’,
        SUMX (
            ‘Filtered Products’,
            CALCULATE (
                COUNTROWS (‘Internet Sales’),
                ALL (‘Internet Sales’),
                ‘Internet Sales'[CustomerKey] = EARLIER (Customer[CustomerKey]),
                ‘Internet Sales'[ProductKey] = EARLIER (‘Filtered Products'[ProductKey]),
                FILTER (
                    ALL (‘Date’),
                    ‘Date'[DateKey] <= MAX (‘Date'[DateKey])
                )
            )
       ) > 0
    )
)

This returns all of the products that were bought for the analyzed product in question. A similar calculation that is going to return the count of products for everything that wasn’t bought looks like this:

Number Non Sales:=CALCULATE (
    COUNTROWS (FILTER(‘Customer’, CALCULATE(COUNTROWS(‘Internet Sales’))=0)),
    FILTER (
        ALL (‘Date’),
        ‘Date'[DateKey] <= MAX (‘Date'[DateKey])
    ),
    FILTER (
        ‘Customer’,
        SUMX (
            ‘Filtered Products’,
            CALCULATE (
                COUNTROWS (‘Internet Sales’),
                ALL (‘Internet Sales’),
                ‘Internet Sales'[CustomerKey] = EARLIER (Customer[CustomerKey]),
                ‘Internet Sales'[ProductKey] = EARLIER (‘Filtered Products'[ProductKey]),
                FILTER (
                    ALL (‘Date’),
                    ‘Date'[DateKey] <= MAX (‘Date'[DateKey])
                )
            )
       ) > 0
    )
)

NOTE: When I initially did this, my Number Non Sales never returned any counts back. It was always blank (0). In my pivot, I had customer gender, and then the number of sales and non sales. After looking at it for a few minutes I realized you wouldn’t be able to use customer attributes in the pivot like that, and you would always have to use the Product table/dimension to browse the measure. Even the filtered products table wouldn’t work, you can only ever use the product table to browse the attribute. Moving to Analyze in Excel, I created a slicer for the product you would want to analyze, and placed Product Category on rows, and dates on columns. The values are the two measures, Number Sales and Number Non Sales. A screen capture is below:

image

 

What’s really nice about this is that you can use it to analyze any product in the table that you would want to analyze. Another way of looking at it is “A brief introduction into data mining in tabular.” If I’m analyzing the Touring 1000- Blue bike, I can tell that 48 out of 61 people have purchased additional accessories. I can then double click to get the list of the 13 customers who haven’t bought any accessories, and then send them an incentive to purchase one and increase my sales, knowing that they are more like than any of my other customers to purchase one.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s