Data Quality Services exists to manage the quality of the data being entered into your data warehouse (or other database). One of the main principles of DQS is the concept of domains, which, if you’ve implemented a master data management strategy before, you would be familiar with. For the rest of us data professionals (me included :)), this is a new concept that we need to understand if we want to use DQS appropriately. There are two main types of domains that we should be familiar with and be able to correctly create and utilize: simple and composite domains. During the rest of this post, I intend to cover the two types in depth, including creating and consuming them.
Single domains are individual by nature and exist alone, these are the most basic domains we can create. To demonstrate, let’s start with Geography data. Addresses always tie to cities, which tie to states, which go to countries. Below is a screenshot of the Geography KB that I’m creating under the Domain Management category:
When we click Next, we’ll then create a domain for Geography by clicking the “Add Domain” at the top left. We’ll create a domain for the main address information, input a description, and make sure the output is proper cased:
Click OK, and go the Domain Values tab. In the real world, we’re going to want to import a lot of this data from somewhere. If we have a list of 100,000 addresses, we’re definitely not going to want to type them all in manually! (Unless you want to super mean to your data analyst…) On the domain values tab, click on the Import button (icon with the arrow pointing left, and select “Import values from Excel”. I’ve also underlined the icon in red in the below screen capture to make it easier to identify:
Import your excel workbook. NOTE: Make sure your workbook is .xls, not .xlsx. DQS does not support xlsx files at this time.
Select the worksheet in the workbook you want to import, and check the “Use first row as header”. Click OK, and DQS will begin importing all of the values:
Once complete, click OK, and return to the Domain Values screen. The list of all the addresses has been imported into our domain for consumption. Once this is published we can consume and check this list to make sure any address being input is correct.
Create a new domain for City, repeating the same steps as above and importing just the City names.
Then do the same thing for State and Country. Once complete, you should have 4 individual domains for each part of the address.
Composite domains give us even more control over our individual domains by allowing us to relate domains together. We can do things like add reference services from Azure (address, for example), create cross-domains rules, and add some more flexibility into our solution.
Once this is complete, at the top left right next to where you click to Create a single domain, click the “Create composite domain” icon next to it:
The composite domain dialog box appears, and we then need to configure our single domains appropriately. In the screenshot below I’ve titled it Geography, and moved over all of the single domains into a new combined composite.
Once you click OK, you then have a composite domain that you can use against the data coming in. By adding a matching policy or knowledge discovery to the domain, we can start inferring and combining data and records that belong together, even though they come from multiple source systems.
The workbook that I used for the source of this data is available here
Stay tuned for more!