With the release of SQL Server 2012, one of the great features that we gain is the ability to manage data quality much easier, and without building something custom. We can do this with a tool called Data Quality Services, and using it is as easy as 1-2-3. In this blog post, I’m going to create a DQS knowledge base to handle a list of states or provinces that Adventure Works does business in. To demonstrate, let’s take a look at the simple spreadsheet I have here showing a list of countries in the world. Notice there is Florida in the list twice, spelled incorrectly and with the country abbreviated:
Pretty straight-forward so far. Now what we want to do is use this list as a master list of all of the states or provinces that we do business in. This way if we get invalid records, rather than marking them as unknown or inferred we can ensure that they are sent to the proper individual (the data steward) to be handled appropriately.
Next, open up the Data Quality client, accessible by going to Start-> All Programs –> SQL Server 2012->Data Quality Services. Connect to your installed DQS instance, and the main page will appear:
Create a new knowledgebase by clicking on the “New Knowledge Base” in the top left hand corner. Give your knowledge base a name, a description, and select what type of knowledge base it should be. There are 3 activities listed, Domain Management, Knowledge discovery, and Matching Policy. Domain management allows you to create a new domain from a blank canvas, knowledge discovery lets you test against an existing dataset, and matching policy lets you either a) identify duplicate records, or b) identify related records.
I’m going to use a domain management knowledge base, and I’m going to call it geography.
Click Next, and the Domain Management window appears. Domains in data quality services are very similar to columns in a relational database, just with a defined list of attributes of what is available. I remember it by thinking SELECT DISTINCT, where the list of rows returned is what your attributes should be.
For now, let’s create one.
Click the “Create a new domain” button and the Create Domain dialog box appears.
We’ll call ours StateProvince, and leave the data type as String. Notice that DQS only offers 4 data types: String, Date, Integer, and Decimal. Additionally, under the “Format Output to”, we can standardize the output of the domain to be all upper case, lower case, or capitalized. For now, leave it as none and click OK. Once we’ve clicked OK, you now get the ability to adjust the rules around the domain with the following domain management screen:
The first tab, Domain Properties, is identical to the tab we just configured. This is useful if after creating the domain, we decide we want to change the output or adjust some of the other properties that we’ve already set. The second tab that we have is a reference data tab. If you have an Azure DataMarket account or want to use an online repository to manage the data set, you can create and configure it in this screen.
The third tab is the Domain Rules tab. In the domain rules tab you set all of the rules that apply to the data set. Before we start creating some rules, though, let’s import our spreadsheet about states in so we can analyze it. The fourth tab, Domain Values, allows us to import a data set that we can use to evaluate as we are creating our rules, to ensure that any rules that we create aren’t a) invalidating every row, or a huge sample of them or b) not affecting any rows, or c) any other length/character analysis. To add our sheet to the Domain Values tab, expand the Import button on the top right of the tab:
Select the workbook and the Sheet name, and select the “Use first row as header” check box:
Once the import is complete, the Import Domain values dialog box appears with a summary of what was set as a synonym, if the Data Quality client detected any. After we click OK, each row is imported in as a valid domain value:
Remember our issue with Florida being spelled incorrectly. Let’s set them to be synonyms and to automatically correct any rows that come in as Folrda to Florida. Find the two values, and select them both by holding the CTRL key and selecting them with your mouse. Right click and select “Set as Synonyms”
Once you set them as synonyms’, DQS maps the two together. Notice that Folrda has now been set to correct any rows to Florida.
Go back to the Domain rules tab, and let’s create a domain rule. Click on the add a Domain rule icon.
Under the Build a Rule section, we can build a rule around different rules that we want our data to adhere to:
Under Name, create a rule called Length, and under the Build A Rule section, select Greater than or Equal to…, and then type in 5. Click Apply all rules and warning will appear:
Click yes, and then click Ok. Then click Finish, and Publish. The knowledge base is created and published. In a future blog post I’ll show how to use SSIS to consume the published knowledge base in an ETL design to help load your data into your data warehouse.