Using fuzzy lookup
for cleansing data using SSIS
Problem Statement : The ETL team has developed a sql code for
cleansing the data from ERP such as Tally. The code entails multiple conditions
for lookup of special characters. The ETL team intends to automate the lookup
process having special characters. Hence it has developed an SSIS package for
the same.
The original sql code has code for replacement of special
characters as follows:
1st special character lookup where ‘amp;’ was
replaced by ‘&’ and ‘#13;#10;’ was replaced by space
2nd special character lookup: in this the special
characters within SQL such as char(9),char(11),char(12),char(13) have been
replaced by blank.
The issue with the above code was that during each ETL
extract across multiple companies, the ETL team had to look for various special
characters because of the nature of the input data source. The ETL team desired
to automate the entire process using SSIS
( fuzzy lookup function).
The fuzzy look up
transform takes the input data from data flow and provide the connection with
sql server for a specific database to retrieve the input table from the same database
and match the column of the table as
shown in below:
In the first step, the ETL team extracted the required
columns from the source table.
After retrieving the required input columns fuzzy lookup transform uses
the reference table to match the exact value between input column and look up
column as shown below:
For exact matching set the value of threshold in fuzzy
lookup transforms as shown in below:
The ETL team used an approximation of 65%(based on trial and
error). It also set up the maximum number of matches per lookup output to be
1(no duplicate values). The special characters to be fed into the lookup
transform have been attached in the fuzzy lookup ‘additional delimiters’
section.
For automating the following portion of the sql code
entailing filter parameters:
The team used the conditional split as follows:
The conditional split is used in SSIS tool for condition such
as IF and case statement as used in programming language and set the variables,
parameters ,predefined functions and
operators .
Union all is used to combine multiple sources and produce
one output without sorted the data where the order of data is not necessary. If
wants to sorted the data merge join is used instead of union all
transformation.
Contact details:
For resource requirements:resources@mnnbi.com
For training and consulting services: apoorv@mnnbi.com
Blogspot details: https://mndatasolutionsindia.blogspot.com/
Written by Gauri(gauri@mnnbi.com)
https://mndatasolutionsindia.blogspot.com/p/gauri.html
For resource requirements:resources@mnnbi.com
For training and consulting services: apoorv@mnnbi.com
Blogspot details: https://mndatasolutionsindia.blogspot.com/
Written by Gauri(gauri@mnnbi.com)
https://mndatasolutionsindia.blogspot.com/p/gauri.html