Using fuzzy lookup for cleansing data using SSIS
Problem Statement : The ETL team has developed a sql code for cleansing the data from ERP such as Tally. The code entails multiple conditions for lookup of special characters. The ETL team intends to automate the lookup process having special characters. Hence it has developed an SSIS package for the same.
The original sql code has code for replacement of special characters as follows:
1st special character lookup where ‘amp;’ was replaced by ‘&’ and ‘#13;#10;’ was replaced by space
2nd special character lookup: in this the special characters within SQL such as char(9),char(11),char(12),char(13) have been replaced by blank.
The issue with the above code was that during each ETL extract across multiple companies, the ETL team had to look for various special characters because of the nature of the input data source. The ETL team desired to automate the entire process using SSIS ( fuzzy lookup function).
The fuzzy look up transform takes the input data from data flow and provide the connection with sql server for a specific database to retrieve the input table from the same database and match the column of the table as shown in below:
In the first step, the ETL team extracted the required columns from the source table.
After retrieving the required input columns fuzzy lookup transform uses the reference table to match the exact value between input column and look up column as shown below:
For exact matching set the value of threshold in fuzzy lookup transforms as shown in below:
The ETL team used an approximation of 65%(based on trial and error). It also set up the maximum number of matches per lookup output to be 1(no duplicate values). The special characters to be fed into the lookup transform have been attached in the fuzzy lookup ‘additional delimiters’ section.
For automating the following portion of the sql code entailing filter parameters:
The team used the conditional split as follows:
The conditional split is used in SSIS tool for condition such as IF and case statement as used in programming language and set the variables, parameters ,predefined functions and operators .
Union all is used to combine multiple sources and produce one output without sorted the data where the order of data is not necessary. If wants to sorted the data merge join is used instead of union all transformation.
Contact details:
For resource requirements:resources@mnnbi.com
For training and consulting services: apoorv@mnnbi.com
Blogspot details: https://mndatasolutionsindia.blogspot.com/
Written by Gauri(gauri@mnnbi.com)
https://mndatasolutionsindia.blogspot.com/p/gauri.html
For resource requirements:resources@mnnbi.com
For training and consulting services: apoorv@mnnbi.com
Blogspot details: https://mndatasolutionsindia.blogspot.com/
Written by Gauri(gauri@mnnbi.com)
https://mndatasolutionsindia.blogspot.com/p/gauri.html
No comments:
Post a Comment