SSIS & DQS: The Knowledgebase does not exist

As IT professionals, we always take an approach during development of dev->test->prod (or at least dev->prod). With the release of Microsoft’s Data Quality Services in 2012, it is natural that we should assume that the progression of creating and publishing rules and domains for our data quality knowledge bases would take the same path. As Business Intelligence developers, we then consume the knowledge base with an SSIS package during transformation and deploy. Unfortunately, with Data Quality Services there is a known issue where this doesn’t always work. Less than a week after I solved the issue for my client, I received an email from one of our Microsoft partners out there asking the same question about the same error. I thought I would detail the error here in case anyone else encounters the error as well. The error only appears on the deployment (target) server, and gives you this error message:

Data Flow Task Name:Error: Microsoft.Ssdqs.Infra.Exceptions.EntryPointException: The Knowledge Base does not exist [Id : 1000999].
at Microsoft.Ssdqs.Proxy.Database.DBAccessClient.Exec()
at Microsoft.Ssdqs.Proxy.EntryPoint.KnowledgebaseManagementEntryPointClient.DQProjectGetById(Int64 id)
at Microsoft.Ssdqs.Component.DataCorrection.Logic.DataCorrectionComponent.PostExecute()
at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPostExecute(IDTSManagedComponentWrapper100 wrapper)
Data Flow Task Name:Error: System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Ssdqs.Component.DataCorrection.Logic.DataCorrectionComponent.ProcessChunk(ReadOnlyCollection`1 fieldMappings, ReadOnlyCollection`1 records, CorrectedRecordsStatusStatistics& correctedRecordsTotalStatusStatistics)
at Microsoft.Ssdqs.Component.DataCorrection.Logic.DataCorrectionComponent.ProcessInput(Int32 inputID, PipelineBuffer buffer)
at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostProcessInput(IDTSManagedComponentWrapper100 wrapper, Int32 inputID, IDTSBuffer100 pDTSBuffer, IntPtr bufferWirePacket)

The line we’re concerned with in this case is the top line: The knowledge base does not exist. The error means that SSIS cannot find the corresponding knowledge base in the DQS_Main database, which we can verify by looking at the dbo.A_KNOWLEDGEBASE table:

image

The key concept here is that every time you publish a knowledge base, a new ID is created. Going back to our opening statement, we therefore typically would think of the cycle as looking like this:

image

Every time we do a deployment, in addition to deploying the SSIS packages, we need to deploy the DQS knowledge base as well. However, knowing now that a new ID Is created every time we cross the “deployment line”, SSIS will never be able to maintain sync with the DQS knowledgebase ID number, as it is only natural that dev will have many more releases than production would, knocking the knowledgebase ID’s out of sync across servers. There seems to be several ways to go around this issue, some (to me) seem better than others:

Option 1: Manually open the package in the destination server within data tools to force the knowlegebase_id to refresh.

NOTE: Since I don’t know many shops that generally try to make it a habit of opening things for refreshes in production, I’m just going to disqualify this one and move on.

Option 2: View the xml code of the package, and change the knowledgebase_id manually on the package to get it to look to the right knowledgebase.

Solution: A slightly better alternative than opening the package at least. We can do this by connecting to our DQS_Main database, and navigating to the A_KNOWLEDGEBASE table that we looked at earlier. By running the following query:

SELECT MAX([ID]) AS [KB_ID]
      ,[NAME]
FROM [dbo].[A_KNOWLEDGEBASE]
WHERE [NAME] LIKE ‘%PutYourKnowledgebaseNameHere%’

This will return the maximum ID for the corresponding knowledgebase you’re looking for. We can then use this value by changing the value inline in the SSIS package XML tags. Locate the SSIS package that is calling the knowledge base in the file system, and right click on it, and open it with your favorite XML editor.  Once we open it, we’re looking for the line indicating the knowledgebase name, which we want to replace with the number that we returned in our query.

Again, though, we have the same issue as in Option 1 where we need to manually edit things in production.

Option 3: This solution makes the most sense to me, but it’s a definite shift from the general way we’re used to doing things in IT (and development in general). Remember that, at its core, DQS is designed to be a product used by an end user. It is, in actuality, a tool that a business analyst can (and should) use to create rules around their data, that we as developers consume. The concept of development and production, in this case, almost seem as though they should all be done in the same step, allowing the analyst to preview and analyze before they do the initial publish. If there is 1 knowledgebase that everything points to and is maintained by the analyst, we don’t have to worry about deploying DQS databases across environments as well. Additionally, DQS is such a new product that just teaching someone the basics of what they can do with it is enough, teaching a business analyst the ins and outs of deployment processes seems unnecessary. Visually, this approach looks like the below.

image

Having one knowledgebase maintained by the user allows all ETL processes, whether dev, test, production, or whatever the ability to all read from the same knowledgebase. Doing this prevents this error from occurring, as no matter how many times our users updated the knowledgebase, all SSIS packages will always pull from the most recent knowledge base ID, as they will always be querying the same A_KNOWLEDGEBASE table located in the DQS_MAIN database.

Advertisements

DQS: Consuming Azure Data Sets in a Domain

In my opinion, one of the coolest features in DQS is the ability to consume information from an Azure data set. Not surprisingly, one of the biggest providers of data sets for DQS on Azure is MelissaData, which has been around for years in the data quality arena and has some awesome features that you can purchase. Even better, with the Azure datamarket, some of the bigger, more common data quality pieces can be tried and demo’d for free. There are DQS API’s in the data market for US & Canadian addresses, phone numbers, ip addresses, name check, and Email check. The best part about these The big question is, when creating a domain in a knowledge base, how can we consume some of these to use against our data set?

In my last blog post, I created a knowledge base with both single and composite domains and I’ll use that for this post to connect it to Azure. To start, let’s open up our DQS client, connect it to our local instance, and go into Configuration on the Administration tab:

 

image

On the configuration page, enter your datamarket account ID at the top where prompted: (Your account ID can be accessed on the azure datamarket account page, it is your PrimaryAccountKey that gets validated)

image

 

Step 1 done. Now let’s go back to our knowledge base with our composite geography data. I’ve subscribed to a MelissaData reference data set on geography, available in the Azure datamarket here. Click on the Geography composite domain, and click on the Reference data tab:

 

image

 

Select browse at the top right, and then click on the data set you want. I’ll be selecting geography:

 

image

 

Now we map each domain into the domain that we want to check in the reference set. Map Address to address, city to city, etc. To add more, click on the add icon at the top right of the table. Once done, click Ok, and the provider settings page will appear:

image

We’ll leave the auto correct threshold at the default for now. Click Finish to publish the knowledge base. 

You now have two options. You can either a) run a data quality project against data that you already have in your database to evaluate, or b) use SSIS to check rows coming in. Remembering that in SSIS/DQS implementations, all of the work is done on the DQS server, and it is single-threaded (causing performance hits to start), when you use Azure data you go another step further out to a web service. A complete lifecycle picture of it would look something like this:

 

image

 

If you want to see how to configure SSIS, see my earlier blog post here, for details on how to create a data quality project, Chris Price has an excellent piece introducing it here

DQS: Single vs Composite Domains

Data Quality Services exists to manage the quality of the data being entered into your data warehouse (or other database). One of the main principles of DQS is the concept of domains, which, if you’ve implemented a master data management strategy before, you would be familiar with. For the rest of us data professionals (me included :)), this is a new concept that we need to understand if we want to use DQS appropriately. There are two main types of domains that we should be familiar with and be able to correctly create and utilize: simple and composite domains. During the rest of this post, I intend to cover the two types in depth, including creating and consuming them.

Single domains are individual by nature and exist alone, these are the most basic domains we can create. To demonstrate, let’s start with Geography data. Addresses always tie to cities, which tie to states, which go to countries. Below is a screenshot of the Geography KB that I’m creating under the Domain Management category:

image

When we click Next, we’ll then create a domain for Geography by clicking the “Add Domain” at the top left. We’ll create a domain for the main address information, input a description, and make sure the output is proper cased:

 

image

Click OK, and go the Domain Values tab. In the real world, we’re going to want to import a lot of this data from somewhere. If we have a list of 100,000 addresses, we’re definitely not going to want to type them all in manually! (Unless you want to super mean to your data analyst…) On the domain values tab, click on the Import button (icon with the arrow pointing left, and select “Import values from Excel”. I’ve also underlined the icon in red in the below screen capture to make it easier to identify:

image

Import your excel workbook. NOTE: Make sure your workbook is .xls, not .xlsx. DQS does not support xlsx files at this time.

Select the worksheet in the workbook you want to import, and check the “Use first row as header”. Click OK, and DQS will begin importing all of the values:

 

image

 

Once complete, click OK, and return to the Domain Values screen. The list of all the addresses has been imported into our domain for consumption. Once this is published we can consume and check this list to make sure any address being input is correct.

Create a new domain for City, repeating the same steps as above and importing just the City names.

Then do the same thing for State and Country. Once complete, you should have 4 individual domains for each part of the address.

Composite domains give us even more control over our individual domains by allowing us to relate domains together. We can do things like add reference services from Azure (address, for example), create cross-domains rules, and add some more flexibility into our solution.

Once this is complete, at the top left right next to where you click to Create a single domain, click the “Create composite domain” icon next to it:

image 

The composite domain dialog box appears, and we then need to configure our single domains appropriately. In the screenshot below I’ve titled it Geography, and moved over all of the single domains into a new combined composite.

image

Once you click OK, you then have a composite domain that you can use against the data coming in. By adding a matching policy or knowledge discovery to the domain, we can start inferring and combining data and records that belong together, even though they come from multiple source systems.

The workbook that I used for the source of this data is available here

Stay tuned for more!

SSIS: Using the DQS Cleansing component

In addition to actually creating and adding knowledge bases, one of the biggest benefits of Data Quality Services is actually getting to integrate published Data Quality projects into your enterprise data management solution. One of my favorite features in particular about DQS is the control that it gives the business over the logic. Once your business user creates their DQS knowledge base, however, the next step is to actually integrate. The DQS Cleansing component in SSIS allows us to do that integration into our ETL process, and it’s pretty simple to set up in SSIS.

 

To start, I’m going to start with a data source. For this example, I’m going to use the Person.vStateProvinceCountryRegion view in the Adventure Works transactional database, and join it to a published DQS domain around state and province information. My initial SSIS package is just simply a Data Flow task, with an OLE DB source pointing to the geography table:

 

image

 

Then we’ll drag on the DQS cleansing component in the SSIS toolbox, under “Other Transforms”. The DQS Cleansing Transformation Editor will appear:

 

image

 

Connect to the DQS instance of SQL server that you have installed, by clicking on the New button under Data Quality connection manager, and type in the server name in the pop up window.

 image

 

Select the Data Quality Knowledge base that you want to connect to. Every DQS knowledge base that is published is accessible through this link.

 

image

 

We’ll select DQS data, and then the list of available domains will appear. Remember that DQS domains are column level, so we’d have a separate domain for each column:

 

image

 

Click on the mapping tab, and then check the box next to the State-Province name field. Under the dropdown in the Domain column, select US-State here, as we’re mapping the State Province name to the US State domain in DQS.

 

image

 

Click Ok on the component, and then create a dead end in your package by dragging on, for example, a union all or multicast. Add a data viewer and click Execute. The data viewer will appear and we’ll see the following rows coming through the buffer:

 

image

 

Notice a couple of things about our output. Whereas before we had one column for the source column, we now have a StateProvinceName_Source and a StateProvinceName_Output column. If we have DQS rules that auto-corrected, say Folirad to Florida, we would see the correct value of Florida next to the invalid one (Folirad). Additionally, at the end there is a record status column, indicating what kind of value the row falls into. The available options are:

 

Correct: The value in this column is right as it’s coming through, doesn’t need adjusted

Corrected: the data was wrong, but DQS was able to correct it to what we want

Suggested: the data was wrong, but it was *close* to a correct value, and DQS thinks that this value can be corrected.

Invalid: In DQS, this value was marked as invalid

New: the data in this row wasn’t in the DQS domain at all, and DQS doesn’t know whether to accept or reject, and has never seen this value in this domain before

 

A couple of items to note in regards to the DQS component:

 

According to Microsoft, the best practice approach is to have a conditional split directly under this component, to output each of the different record statuses down varying outputs. This is referenced via the following blog article here.

The actual Data Quality work is done on the Data Quality server, NOT the SSIS server. This means that if you are running your package on Server A, and the DQS server is installed on Server B, each row that comes through is sent to the Data Quality server on Server B for completion. The actual data quality component is really very little more than a client served on the SSIS server. This will also affect performance greatly!

 

I’m sure there is a lot more that can be done in DQS, and I am really looking forward to doing some more digging around this component!

DQS: Creating the Knowledge Base

With the release of SQL Server 2012, one of the great features that we gain is the ability to manage data quality much easier, and without building something custom. We can do this with a tool called Data Quality Services, and using it is as easy as 1-2-3. In this blog post, I’m going to create a DQS knowledge base to handle a list of states or provinces that Adventure Works does business in. To demonstrate, let’s take a look at the simple spreadsheet I have here showing a list of countries in the world. Notice there is Florida in the list twice, spelled incorrectly and with the country abbreviated:

 

image

 

Pretty straight-forward so far. Now what we want to do is use this list as a master list of all of the states or provinces that we do business in. This way if we get invalid records, rather than marking them as unknown or inferred we can ensure that they are sent to the proper individual (the data steward) to be handled appropriately.

 

Next, open up the Data Quality client, accessible by going to Start-> All Programs –> SQL Server 2012->Data Quality Services. Connect to your installed DQS instance, and the main page will appear:

image

 

Create a new knowledgebase by clicking on the “New Knowledge Base” in the top left hand corner. Give your knowledge base a name, a description, and select what type of knowledge base it should be. There are 3 activities listed, Domain Management, Knowledge discovery, and Matching Policy. Domain management allows you to create a new domain from a blank canvas, knowledge discovery lets you test against an existing dataset, and matching policy lets you either a) identify duplicate records, or b) identify related records.

I’m going to use a domain management knowledge base, and I’m going to call it geography.

 

image

 

Click Next, and the Domain Management window appears. Domains in data quality services are very similar to columns in a relational database, just with a defined list of attributes of what is available. I remember it by thinking SELECT DISTINCT, where the list of rows returned is what your attributes should be.

For now, let’s create one.

Click the “Create a new domain” button and the Create Domain dialog box appears.

 

image

 

We’ll call ours StateProvince, and leave the data type as String. Notice that DQS only offers 4 data types: String, Date, Integer, and Decimal. Additionally, under the “Format Output to”, we can standardize the output of the domain to be all upper case, lower case, or capitalized. For now, leave it as none and click OK. Once we’ve clicked OK, you now get the ability to adjust the rules around the domain with the following domain management screen:

image

 

The first tab, Domain Properties, is identical to the tab we just configured. This is useful if after creating the domain, we decide we want to change the output or adjust some of the other properties that we’ve already set. The second tab that we have is a reference data tab. If you have an Azure DataMarket account or want to use an online repository to manage the data set, you can create and configure it in this screen.

 

The third tab is the Domain Rules tab. In the domain rules tab you set all of the rules that apply to the data set. Before we start creating some rules, though, let’s import our spreadsheet about states in so we can analyze it. The fourth tab, Domain Values, allows us to import a data set that we can use to evaluate as we are creating our rules, to ensure that any rules that we create aren’t a) invalidating every row, or a huge sample of them or b) not affecting any rows, or c) any other length/character analysis. To add our sheet to the Domain Values tab, expand the Import button on the top right of the tab:

 

image

 

Select the workbook and the Sheet name, and select the “Use first row as header” check box:

 

image

 

Once the import is complete, the Import Domain values dialog box appears with a summary of what was set as a synonym, if the Data Quality client detected any. After we click OK, each row is imported in as a valid domain value:

 

image

 

 

 

Remember our issue with Florida being spelled incorrectly. Let’s set them to be synonyms and to automatically correct any rows that come in as Folrda to Florida. Find the two values, and select them both by holding the CTRL key and selecting them with your mouse. Right click and select “Set as Synonyms”

 

image

 

Once you set them as synonyms’, DQS maps the two together. Notice that Folrda has now been set to correct any rows to Florida.

 

Go back to the Domain rules tab, and let’s create a domain rule. Click on the add a Domain rule icon.

 

image

 

Under the Build a Rule section, we can build a rule around different rules that we want our data to adhere to:

image

 

Under Name, create a rule called Length, and under the Build A Rule section, select Greater than or Equal to…, and then type in 5. Click Apply all rules and warning will appear:

image

 

Click yes, and then click Ok. Then click Finish, and Publish. The knowledge base is created and published. In a future blog post I’ll show how to use SSIS to consume the published knowledge base in an ETL design to help load your data into your data warehouse.