Wednesday, 5 June 2024

Fuzzy Lookup for replacing special characters in SSIS

 Using fuzzy lookup for cleansing data using SSIS

Problem Statement : The ETL team has developed a sql code for cleansing the data from ERP such as Tally. The code entails multiple conditions for lookup of special characters. The ETL team intends to automate the lookup process having special characters. Hence it has developed an SSIS package for the same.

The original sql code has code for replacement of special characters as follows:

 



1st special character lookup where ‘amp;’ was replaced by   ‘&’ and ‘#13;#10;’ was replaced by space  



2nd special character lookup: in this the special characters within SQL such as char(9),char(11),char(12),char(13) have been replaced by blank.

The issue with the above code was that during each ETL extract across multiple companies, the ETL team had to look for various special characters because of the nature of the input data source. The ETL team desired to automate the entire process using SSIS  ( fuzzy lookup function).


 The replacement of special characters using fuzzy look up transformation which performs data cleansing task such as data correction and providing missing values. Fuzzy look up transforms return data with one or more closest matching values. It does not differentiate between close values and provides exact matching. Fuzzy look up transformation is used to correct and cleanse the data. The fuzzy look up transform takes input data from the data flow and matches with a table for the mismatch values.



The fuzzy look up transform takes the input data from data flow and provide the connection with sql server   for a  specific database to  retrieve the input table from the same database and match the column of the table  as shown in below:

In the first step, the ETL team extracted the required columns from the source table.




After retrieving the  required input columns fuzzy lookup transform uses the reference table to match the exact value between input column and look up column as shown below:



For exact matching set the value of threshold in fuzzy lookup transforms as shown in below:
The ETL team used an approximation of 65%(based on trial and error). It also set up the maximum number of matches per lookup output to be 1(no duplicate values). The special characters to be fed into the lookup transform have been attached in the fuzzy lookup ‘additional delimiters’ section.




For automating the following portion of the sql code entailing filter parameters:

The team used the conditional split as follows:

The conditional split is used in SSIS tool for condition such as IF and case statement as used in programming language and set the variables, parameters  ,predefined functions and operators .



Union all is used to combine multiple sources and produce one output without sorted the data where the order of data is not necessary. If wants to sorted the data merge join is used instead of union all transformation.

Contact details:
For resource requirements:resources@mnnbi.com
For training and consulting services: apoorv@mnnbi.com
Blogspot details: https://mndatasolutionsindia.blogspot.com/

Written by Gauri(gauri@mnnbi.com)

https://mndatasolutionsindia.blogspot.com/p/gauri.html

No comments:

Post a Comment

Initial and Incremental data Load Template by M&N Business Intelligence-SAP Data Services

  INCREMENTAL LOAD/CREATION OF DIMENSION TABLE LOGIC AT SAP DATA SERVICES END CLIENT In this particular document we shall be looking at the ...