As IT professionals, we always take an approach during development of dev->test->prod (or at least dev->prod). With the release of Microsoft’s Data Quality Services in 2012, it is natural that we should assume that the progression of creating and publishing rules and domains for our data quality knowledge bases would take the same path. As Business Intelligence developers, we then consume the knowledge base with an SSIS package during transformation and deploy. Unfortunately, with Data Quality Services there is a known issue where this doesn’t always work. Less than a week after I solved the issue for my client, I received an email from one of our Microsoft partners out there asking the same question about the same error. I thought I would detail the error here in case anyone else encounters the error as well. The error only appears on the deployment (target) server, and gives you this error message:
Data Flow Task Name:Error: Microsoft.Ssdqs.Infra.Exceptions.EntryPointException: The Knowledge Base does not exist [Id : 1000999].
at Microsoft.Ssdqs.Proxy.EntryPoint.KnowledgebaseManagementEntryPointClient.DQProjectGetById(Int64 id)
at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPostExecute(IDTSManagedComponentWrapper100 wrapper)
Data Flow Task Name:Error: System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Ssdqs.Component.DataCorrection.Logic.DataCorrectionComponent.ProcessChunk(ReadOnlyCollection`1 fieldMappings, ReadOnlyCollection`1 records, CorrectedRecordsStatusStatistics& correctedRecordsTotalStatusStatistics)
at Microsoft.Ssdqs.Component.DataCorrection.Logic.DataCorrectionComponent.ProcessInput(Int32 inputID, PipelineBuffer buffer)
at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostProcessInput(IDTSManagedComponentWrapper100 wrapper, Int32 inputID, IDTSBuffer100 pDTSBuffer, IntPtr bufferWirePacket)
The line we’re concerned with in this case is the top line: The knowledge base does not exist. The error means that SSIS cannot find the corresponding knowledge base in the DQS_Main database, which we can verify by looking at the dbo.A_KNOWLEDGEBASE table:
The key concept here is that every time you publish a knowledge base, a new ID is created. Going back to our opening statement, we therefore typically would think of the cycle as looking like this:
Every time we do a deployment, in addition to deploying the SSIS packages, we need to deploy the DQS knowledge base as well. However, knowing now that a new ID Is created every time we cross the “deployment line”, SSIS will never be able to maintain sync with the DQS knowledgebase ID number, as it is only natural that dev will have many more releases than production would, knocking the knowledgebase ID’s out of sync across servers. There seems to be several ways to go around this issue, some (to me) seem better than others:
Option 1: Manually open the package in the destination server within data tools to force the knowlegebase_id to refresh.
NOTE: Since I don’t know many shops that generally try to make it a habit of opening things for refreshes in production, I’m just going to disqualify this one and move on.
Option 2: View the xml code of the package, and change the knowledgebase_id manually on the package to get it to look to the right knowledgebase.
Solution: A slightly better alternative than opening the package at least. We can do this by connecting to our DQS_Main database, and navigating to the A_KNOWLEDGEBASE table that we looked at earlier. By running the following query:
SELECT MAX([ID]) AS [KB_ID]
WHERE [NAME] LIKE ‘%PutYourKnowledgebaseNameHere%’
This will return the maximum ID for the corresponding knowledgebase you’re looking for. We can then use this value by changing the value inline in the SSIS package XML tags. Locate the SSIS package that is calling the knowledge base in the file system, and right click on it, and open it with your favorite XML editor. Once we open it, we’re looking for the line indicating the knowledgebase name, which we want to replace with the number that we returned in our query.
Again, though, we have the same issue as in Option 1 where we need to manually edit things in production.
Option 3: This solution makes the most sense to me, but it’s a definite shift from the general way we’re used to doing things in IT (and development in general). Remember that, at its core, DQS is designed to be a product used by an end user. It is, in actuality, a tool that a business analyst can (and should) use to create rules around their data, that we as developers consume. The concept of development and production, in this case, almost seem as though they should all be done in the same step, allowing the analyst to preview and analyze before they do the initial publish. If there is 1 knowledgebase that everything points to and is maintained by the analyst, we don’t have to worry about deploying DQS databases across environments as well. Additionally, DQS is such a new product that just teaching someone the basics of what they can do with it is enough, teaching a business analyst the ins and outs of deployment processes seems unnecessary. Visually, this approach looks like the below.
Having one knowledgebase maintained by the user allows all ETL processes, whether dev, test, production, or whatever the ability to all read from the same knowledgebase. Doing this prevents this error from occurring, as no matter how many times our users updated the knowledgebase, all SSIS packages will always pull from the most recent knowledge base ID, as they will always be querying the same A_KNOWLEDGEBASE table located in the DQS_MAIN database.