DQS: Consuming Azure Data Sets in a Domain

In my opinion, one of the coolest features in DQS is the ability to consume information from an Azure data set. Not surprisingly, one of the biggest providers of data sets for DQS on Azure is MelissaData, which has been around for years in the data quality arena and has some awesome features that you can purchase. Even better, with the Azure datamarket, some of the bigger, more common data quality pieces can be tried and demo’d for free. There are DQS API’s in the data market for US & Canadian addresses, phone numbers, ip addresses, name check, and Email check. The best part about these The big question is, when creating a domain in a knowledge base, how can we consume some of these to use against our data set?

In my last blog post, I created a knowledge base with both single and composite domains and I’ll use that for this post to connect it to Azure. To start, let’s open up our DQS client, connect it to our local instance, and go into Configuration on the Administration tab:

 

image

On the configuration page, enter your datamarket account ID at the top where prompted: (Your account ID can be accessed on the azure datamarket account page, it is your PrimaryAccountKey that gets validated)

image

 

Step 1 done. Now let’s go back to our knowledge base with our composite geography data. I’ve subscribed to a MelissaData reference data set on geography, available in the Azure datamarket here. Click on the Geography composite domain, and click on the Reference data tab:

 

image

 

Select browse at the top right, and then click on the data set you want. I’ll be selecting geography:

 

image

 

Now we map each domain into the domain that we want to check in the reference set. Map Address to address, city to city, etc. To add more, click on the add icon at the top right of the table. Once done, click Ok, and the provider settings page will appear:

image

We’ll leave the auto correct threshold at the default for now. Click Finish to publish the knowledge base. 

You now have two options. You can either a) run a data quality project against data that you already have in your database to evaluate, or b) use SSIS to check rows coming in. Remembering that in SSIS/DQS implementations, all of the work is done on the DQS server, and it is single-threaded (causing performance hits to start), when you use Azure data you go another step further out to a web service. A complete lifecycle picture of it would look something like this:

 

image

 

If you want to see how to configure SSIS, see my earlier blog post here, for details on how to create a data quality project, Chris Price has an excellent piece introducing it here

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s