SSIS: Using the DQS Cleansing component

In addition to actually creating and adding knowledge bases, one of the biggest benefits of Data Quality Services is actually getting to integrate published Data Quality projects into your enterprise data management solution. One of my favorite features in particular about DQS is the control that it gives the business over the logic. Once your business user creates their DQS knowledge base, however, the next step is to actually integrate. The DQS Cleansing component in SSIS allows us to do that integration into our ETL process, and it’s pretty simple to set up in SSIS.


To start, I’m going to start with a data source. For this example, I’m going to use the Person.vStateProvinceCountryRegion view in the Adventure Works transactional database, and join it to a published DQS domain around state and province information. My initial SSIS package is just simply a Data Flow task, with an OLE DB source pointing to the geography table:




Then we’ll drag on the DQS cleansing component in the SSIS toolbox, under “Other Transforms”. The DQS Cleansing Transformation Editor will appear:




Connect to the DQS instance of SQL server that you have installed, by clicking on the New button under Data Quality connection manager, and type in the server name in the pop up window.



Select the Data Quality Knowledge base that you want to connect to. Every DQS knowledge base that is published is accessible through this link.




We’ll select DQS data, and then the list of available domains will appear. Remember that DQS domains are column level, so we’d have a separate domain for each column:




Click on the mapping tab, and then check the box next to the State-Province name field. Under the dropdown in the Domain column, select US-State here, as we’re mapping the State Province name to the US State domain in DQS.




Click Ok on the component, and then create a dead end in your package by dragging on, for example, a union all or multicast. Add a data viewer and click Execute. The data viewer will appear and we’ll see the following rows coming through the buffer:




Notice a couple of things about our output. Whereas before we had one column for the source column, we now have a StateProvinceName_Source and a StateProvinceName_Output column. If we have DQS rules that auto-corrected, say Folirad to Florida, we would see the correct value of Florida next to the invalid one (Folirad). Additionally, at the end there is a record status column, indicating what kind of value the row falls into. The available options are:


Correct: The value in this column is right as it’s coming through, doesn’t need adjusted

Corrected: the data was wrong, but DQS was able to correct it to what we want

Suggested: the data was wrong, but it was *close* to a correct value, and DQS thinks that this value can be corrected.

Invalid: In DQS, this value was marked as invalid

New: the data in this row wasn’t in the DQS domain at all, and DQS doesn’t know whether to accept or reject, and has never seen this value in this domain before


A couple of items to note in regards to the DQS component:


According to Microsoft, the best practice approach is to have a conditional split directly under this component, to output each of the different record statuses down varying outputs. This is referenced via the following blog article here.

The actual Data Quality work is done on the Data Quality server, NOT the SSIS server. This means that if you are running your package on Server A, and the DQS server is installed on Server B, each row that comes through is sent to the Data Quality server on Server B for completion. The actual data quality component is really very little more than a client served on the SSIS server. This will also affect performance greatly!


I’m sure there is a lot more that can be done in DQS, and I am really looking forward to doing some more digging around this component!


4 thoughts on “SSIS: Using the DQS Cleansing component”

  1. Thanks for the article. I see that when a column’s value is deemed “New”, the record status shows as “Correct”. Other than by brute force of a gigantic OR clause (because I have many columns to cleanse), how can I make the record have a status of “New” to signify that some column’s status is “New”? Thanks.

    1. HI Adolfo, Sorry for the delay on this, I wanted to think about your problem. Since you have many columns that you need to cleanse, the challenge is by default you would have to create that OR clause. I think what I’d recommend is creating a staging table in a database that you could use to join all of the columns into one “DQS Status” column. You could then use that as a source in a new data flow and port it to different conditional splits. that way. Kind of a pain, but a work around. You’d also need to have a column name column and a value column. This is all theory in my head, but I think it would work.

      1. Thanks for that idea. I was thinking too about a staging table and going with a metadata solution by which I will search for the text “New” in all columns whose name ends in “_Status”.

  2. Hi, instead of the staging table I wrote a script component that looks at columns whose names end with “_Status”. If any has the value “New”, then I updated the value of the “Record Status” column to “New”. My conditional split can then handle the rest as I wanted it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s