Recently I came across a post online that was providing a solution to duplicates in a data flow. As I was reading it, in it there was something that makes me cringe: a Sort transform. Those of you who know me know that the Sort transform in a data flow gives me nightmares at night, almost as much so as the Slowly Changing Dimension wizard on medium and large dimensions. As I was reading it, I kept thinking there has to be a better way than sorting the data inside of the data flow with that component. You could write a script to do it, but my personal opinion is to try and do as much in the data and control flow as possible without scripting, as I feel script components make SSIS much harder to maintain. I came up with an alternate solution, and I would like to use the rest of this blog post to prevent my findings.
For my sample tests here, I am going to use a sample file that I’ve created on customers:
I’m also going to make an assumption that we don’t have DQS available in this environment. If you want more information on how DQS can help with this problem, see my earlier posts here.
The first option we have is to use the Fuzzy Grouping component. Let’s start by creating a simple SSIS package to pick up from the customer csv file. I’ve created a package, the source connection csv manager, and a flat file source in my data flow. We can tell by looking at the file that Bob and Robert are 99% duplicates of either. SSN could work just as well here, or something else as long as we know it really tells who a customer is and what makes them distinct (or product, or account, or facility, etc…) Moving on! Let’s drag over and configure a fuzzy grouping component from the “Other Transforms” section of SSIS. Add the precedence constraint, and open it up.
Configure the OLE DB connection manager, and then go to the columns tab. We want to tell this component the same keys that we identified in the last step, What can the component use to resolve the duplicates together and come up with the distinct (in this case) customers? In our sample set, we have Telephone number. Check the box next to telephone number, and a table will appear. Under the minimum similarity column, input a value to give the component a rule to identify how similar something must be for it to be a match. Depending on your data point, this can be any value. For phone number, we’re going to say if it’s 99% match, it’s probably the same customer or household. This value will change though, it depends on your environment and your specific data point. Groupings of products, for example, may justify lowering this number to something in the 75%-80% range. My completed column looks like the below:
Let’s look at what we have so far. Drag over a multicast, and enable the data viewer on it so we can see the row pass through the pipeline. Execute the package, and our data view appears:
I included the data flow over to the left in this screenshot, so you can also see what the package looks like. We can see the key_in and key_out fields have been added. Let’s take a conditional split component, and configure it with 2 outputs, one for the matching records and one for the other, duplicate records:
For the wrong data, we don’t just want to delete it out of the data flow. We want to keep tracking of it somewhere so we can see what’s being removed out of the data flow. Let’s drag over a destination component for our good and bad records, and configure them. Our finished data flow looks like the below:
Run the package, and let’s check our tables:
We can see that the bad records moved into the appropriate table for auditing, and our new clean records are now available for us to use!