In my opinion, one of the coolest features in DQS is the ability to consume information from an Azure data set. Not surprisingly, one of the biggest providers of data sets for DQS on Azure is MelissaData, which has been around for years in the data quality arena and has some awesome features that you can purchase. Even better, with the Azure datamarket, some of the bigger, more common data quality pieces can be tried and demo’d for free. There are DQS API’s in the data market for US & Canadian addresses, phone numbers, ip addresses, name check, and Email check. The best part about these The big question is, when creating a domain in a knowledge base, how can we consume some of these to use against our data set?
In my last blog post, I created a knowledge base with both single and composite domains and I’ll use that for this post to connect it to Azure. To start, let’s open up our DQS client, connect it to our local instance, and go into Configuration on the Administration tab:
On the configuration page, enter your datamarket account ID at the top where prompted: (Your account ID can be accessed on the azure datamarket account page, it is your PrimaryAccountKey that gets validated)
Step 1 done. Now let’s go back to our knowledge base with our composite geography data. I’ve subscribed to a MelissaData reference data set on geography, available in the Azure datamarket here. Click on the Geography composite domain, and click on the Reference data tab:
Select browse at the top right, and then click on the data set you want. I’ll be selecting geography:
Now we map each domain into the domain that we want to check in the reference set. Map Address to address, city to city, etc. To add more, click on the add icon at the top right of the table. Once done, click Ok, and the provider settings page will appear:
We’ll leave the auto correct threshold at the default for now. Click Finish to publish the knowledge base.
You now have two options. You can either a) run a data quality project against data that you already have in your database to evaluate, or b) use SSIS to check rows coming in. Remembering that in SSIS/DQS implementations, all of the work is done on the DQS server, and it is single-threaded (causing performance hits to start), when you use Azure data you go another step further out to a web service. A complete lifecycle picture of it would look something like this: