In addition to actually creating and adding knowledge bases, one of the biggest benefits of Data Quality Services is actually getting to integrate published Data Quality projects into your enterprise data management solution. One of my favorite features in particular about DQS is the control that it gives the business over the logic. Once your business user creates their DQS knowledge base, however, the next step is to actually integrate. The DQS Cleansing component in SSIS allows us to do that integration into our ETL process, and it’s pretty simple to set up in SSIS.
To start, I’m going to start with a data source. For this example, I’m going to use the Person.vStateProvinceCountryRegion view in the Adventure Works transactional database, and join it to a published DQS domain around state and province information. My initial SSIS package is just simply a Data Flow task, with an OLE DB source pointing to the geography table:
Then we’ll drag on the DQS cleansing component in the SSIS toolbox, under “Other Transforms”. The DQS Cleansing Transformation Editor will appear:
Connect to the DQS instance of SQL server that you have installed, by clicking on the New button under Data Quality connection manager, and type in the server name in the pop up window.
Select the Data Quality Knowledge base that you want to connect to. Every DQS knowledge base that is published is accessible through this link.
We’ll select DQS data, and then the list of available domains will appear. Remember that DQS domains are column level, so we’d have a separate domain for each column:
Click on the mapping tab, and then check the box next to the State-Province name field. Under the dropdown in the Domain column, select US-State here, as we’re mapping the State Province name to the US State domain in DQS.
Click Ok on the component, and then create a dead end in your package by dragging on, for example, a union all or multicast. Add a data viewer and click Execute. The data viewer will appear and we’ll see the following rows coming through the buffer:
Notice a couple of things about our output. Whereas before we had one column for the source column, we now have a StateProvinceName_Source and a StateProvinceName_Output column. If we have DQS rules that auto-corrected, say Folirad to Florida, we would see the correct value of Florida next to the invalid one (Folirad). Additionally, at the end there is a record status column, indicating what kind of value the row falls into. The available options are:
Correct: The value in this column is right as it’s coming through, doesn’t need adjusted
Corrected: the data was wrong, but DQS was able to correct it to what we want
Suggested: the data was wrong, but it was *close* to a correct value, and DQS thinks that this value can be corrected.
Invalid: In DQS, this value was marked as invalid
New: the data in this row wasn’t in the DQS domain at all, and DQS doesn’t know whether to accept or reject, and has never seen this value in this domain before
A couple of items to note in regards to the DQS component:
According to Microsoft, the best practice approach is to have a conditional split directly under this component, to output each of the different record statuses down varying outputs. This is referenced via the following blog article here.
The actual Data Quality work is done on the Data Quality server, NOT the SSIS server. This means that if you are running your package on Server A, and the DQS server is installed on Server B, each row that comes through is sent to the Data Quality server on Server B for completion. The actual data quality component is really very little more than a client served on the SSIS server. This will also affect performance greatly!
I’m sure there is a lot more that can be done in DQS, and I am really looking forward to doing some more digging around this component!