Thursday, 21 February 2019

Using MS Excel as an ETL and Data Cleansing tool-Tally and other ERPs


Business requirement: M&N BI has a product called Turbo data wherein data consolidation, data cleansing, data auditing and data profiling for reporting purposes was been done through SQL Server database. The limitations of the above solution were as follows:
·         Installation time of SQL server: SQL server would take a lot of time and memory to install on the end client machine.
·         Requirement of a database expert for the ETL process: the data from Tally and other ERPs can be very unclean. For cleansing the same, the ETL team required a database expert for the same.
·         Reporting flexibility: The ETL team looked to use MS Excel for increased reporting flexibility.
·         A quick solution for standalone machine: the product should work on standalone machines.

That is the ETL team desired to use Excel for reporting, data auditing and data profiling purposes to give the end clients the following benefits:
·         Reduced time of deployment
·         Reduced cost of the solution
·         Increased flexibility in reporting(specifically for GST purposes)
·         Reduce the cost of the manpower for Excel deployment.


Deployment
Delivery
Usage
Disposal
Faster
ü   
ü   


Easier
ü   
ü   
ü   

Convenient


ü   

More Fun


ü   

Environment Friendly







To develop the scalable solution, the following are the steps to be followed:
·         Extraction of the raw data from the ERP (Tally): The raw data needs to be cleansed, audited and profiled. Thereafter the reports of trial balance, balance sheet, profit and loss statements and duplicate analysis to be developed.
·         The historical version of the balance sheet extract to be stored at the client location (through the consolidation feature in Excel).
·         From the multiple spreadsheets that have been extracted over periods in time, the ETL team shall develop consolidated reports.

Extraction of the raw Tally data: for the same, Tally extractor shall be used (get the data onto flat files using pipe delimiter). The ETL team shall extract daybook, ledger master, item, master, godown master data from Tally ERP 9.0.



For cleaning of the data from tally database.

The ETL team imported the Daybook data. The given input spread sheets has various special characters across various columns. The ETL team was required to clean those special characters.
For example in the case of voucher type, Voucher number one has special characters like JV\9\MAYapos;16  
For removing the same the ETL team used the following excel function.

The cleansing function of the voucher shall incorporate cleansing of the following character sets:
·         apos;
·         amp;
·         quot;
·         |
·         #13;#10;
·         #10;
·         #13;

·         &



 The cleansing function of the voucher


=IF(ISNUMBER(SEARCH("apos;16",B2))=TRUE,REPLACE(B2,FIND("apos;16",B2),7,""),
IF(ISNUMBER(SEARCH("amp;",B2))=TRUE,REPLACE(B2,FIND("amp;",B2),7,""),IF(ISNUMBER(SEARCH("quot;",B2))=TRUE,REPLACE(B2,FIND("quot;",B2),7,""),IF(ISNUMBER(SEARCH("amount;",B2))=TRUE,REPLACE(B2,FIND("amount;",B2),7,""),B2))))



The methodology of cleansing is as follows:

·         Check whether a particular character is present in the string using Find function in excel.
·         Code the replace function in excel to replace the specialized character with blank.
·         Use nested if functions to check for the availability of the characters using ISNUMBER or ISTEXT or Isnontext.
·         The multiple conditions shall be incorporated in multiple IF statements.

This logic was applied for voucher number, voucher typename, companyname, ledgername, godownmname



For example in the case of Voucher typename, Voucher typename one has special characters like Journal-amp;1 
For removing the same the ETL team used the following excel function.

The cleansing function of the voucher shall incorporate cleansing of the following character sets:
·         apos;
·         amp;
·         quot;
·         #13;#10;
·         #10;
·         #13;
·         &
·         -amp;1

The cleansing function of the voucher type name

=IF(ISNUMBER(SEARCH("-amp;1",G2))=TRUE,REPLACE(G2,FIND("-amp;1",G2),8,""),IF(ISNUMBER(SEARCH("quot;",G2))=TRUE,REPLACE(A2,FIND("quot;",G2),8,""),IF(ISNUMBER(SEARCH("amount;",G2))=TRUE,REPLACE(G2,FIND("amount;",G2),8,""),G2)))


The methodology of cleansing is as follows:

·         Check whether a particular character is present in the string using Find function in excel.
·         Code the replace function in excel to replace the specialized character with blank.
·         Use nested if functions to check for the availability of the characters using ISNUMBER or ISTEXT or Isnontext.
·         The multiple conditions shall be incorporated in multiple IF statements.


In the case of Company Name, Company Name one has special characters like ZISON TECHNOLOGIES GHANA LTD Jan-2013
For removing the same the ETL team used the following excel function.

The cleansing function of the voucher shall incorporate cleansing of the following character sets:
·         - 2013
·         apos;
·         amp;
·         quot;
·         Jan-2013
·         -amp;1
The cleansing function of the Company name

=IF(ISNUMBER(SEARCH("Jan-2013",A2))=TRUE,REPLACE(A2,FIND("Jan-2013",A2),29,""),IF(ISNUMBER(SEARCH("quot;",A2))=TRUE,REPLACE(A2,FIND("quot;",A2),29,""),IF(ISNUMBER(SEARCH("amount;",A2))=TRUE,REPLACE(A2,FIND("amount;",A2),29,""),A2)))
The methodology of cleansing is as follows:


·         Check whether a particular character is present in the string using Find function in excel.
·         Code the replace function in excel to replace the specialized character with blank.
·         Use nested if functions to check for the availability of the characters using ISNUMBER or ISTEXT or Isnontext.
The multiple conditions shall be incorporated in multiple IF statements.

The cleansing function of the Party Ledger Name shall incorporate cleansing of the following character sets:

·         apos;
·         amp;
·         quot;
·         Jan-2013
·         &
·         -amp;1


The cleansing function of the Party Ledger Name

=IF(ISNUMBER(SEARCH("apos;16",J5))=TRUE,REPLACE(J5,FIND("apos;16",J5),7,""),
IF(ISNUMBER(SEARCH("amp;",J5))=TRUE,REPLACE(J5,FIND("amp;",J5),7,""),IF(ISNUMBER(SEARCH("quot;",J5))=TRUE,REPLACE(J5,FIND("quot;",J5),7,""),IF(ISNUMBER(SEARCH("amount;",J5))=TRUE,REPLACE(J5,FIND("amount;",J5),7,""),J5))))


The cleansing function of the amount shall incorporate cleansing of the following character sets:
·         |
·         $@ GH? 3.88/  =
·         -3202.18 $ @ GH? 3.88/ $
·         = -GH?
·         2613.00 $ @ GH? 4/ $ = GH? 10452.00|

The cleansing function of the amount

=IF(ISNUMBER(SEARCH("=", Q3))=TRUE,
REPLACE(IF(ISNUMBER(SEARCH("|",Q3))=TRUE,RIGHT(Q3,LEN(Q3)-(SEARCH("=",Q3)+5))),FIND("|",IF(ISNUMBER(SEARCH("|",Q3))=TRUE,RIGHT(Q3,LEN(Q3)-(SEARCH("=",Q3)+5)))),7,""),IF(ISNUMBER(SEARCH("|", Q3))=TRUE,REPLACE(Q3,FIND("|", Q3),1,""),Q3))
The methodology of cleansing is as follows:

·         Check whether a particular character is present in the string using Find function in excel.
·         Code the replace function in excel to replace the specialized character with blank.
·         Used search function in excel for extracting the substring.
·         Use nested if functions to check for the availability of the characters using ISNUMBER or ISTEXT or Isnontext.
·         The multiple conditions shall be incorporated in multiple IF statements.




Next steps:
Please send the Excel spreadsheets for data cleansing
Please send the Tally data for data consolidation, data cleansing, data auditing and data profiling.


Blog prepared by :

Agrima Lohia: https://www.linkedin.com/in/agrima-lohia-77799692/

Please contact the following for additional information:
Name: Apoorv Chaturvedi
Email: apoorv@mnnbi.com
Phone: +91-8802466356


Wednesday, 30 January 2019

Optimizing the query execution times by converting the hash joins to merge joins


The approach of the etl team is to optimize the sql given by the end clients without doing any of the following:
  •          Separating the transaction and the reporting layers
  •         Developing another transaction layer.


In this module we shall analyze the joins that where used as part of the execution process. Thereafter we shall optimize the joins by converting hash joins(no index usage) to nested loop joins(index usage from one table) and thereafter to merge joins(optimize index usage from both the tables).
For the process of join optimization, the etl resource chose the stored procedure sp_item_purchase_order from the datawarehouse load of turbodata as given under.
DECLARE @SUPPLIER INT;
SET @SUPPLIER=(SELECT DIM_SUPPLIER_ID FROM DIM_SUPPLIER WHERE SUPPLIER_NAME LIKE 'N/A');

UPDATE  [STATUS] SET [DISPLAYTEXT]='ITEM_PURCHASE_ORDER LOAD DONE',[COMPLETED]=66;

WITH CTE1 AS(
                 
                  SELECT A.COMPANY_NAME,A.ORDER_NO,A.PARTY_LEDGER_NAME,A.ENTERED_BY,ORDER_DATE,
                  A.AMOUNT,A.AMOUNT_SUBSTRING,A.AMOUNT_SUBSTRING2,A.AMOUNT_SUBSTRING_AFTER$,A.CATEGORY,A.COST_CENTER,
                  A.VOUCHER_TYPE_NAME,A.XYZ,
                  ISNULL(D.DIM_DATE_ID,1) AS DIM_DATE_ID,
                  LEDGERNAME
                  FROM STG_DAYBOOK A
                  LEFT OUTER JOIN DIM_DATE D
                  ON A.ORDER_DATE=D.DATE_DESC        
                        WHERE (A.VOUCHER_TYPE_NAME IN   
                                    (SELECT DISTINCT VOUCHER_TYPE_NAME
                                                FROM STG_VOUCHER_TYPE
                                                WHERE VOUCHER_TYPE IN ('PURCHASE')))
                        --AND A.IS_DEEMED_POSITIVE LIKE '%YES%'
                  )
                  ,
CTE2 AS (
                  SELECT SUM (CAST(A.AMOUNT_SUBSTRING2 AS FLOAT)) AS AMOUNT_SUBSTRING2
                  ,SUM(CAST(A.AMOUNT_SUBSTRING_AFTER$ AS FLOAT)) AS AMOUNT_SUBSTRING_AFTER$
                  ,A.ORDER_DATE, A.ENTERED_BY,
                  A.ORDER_NO, A.VOUCHER_TYPE_NAME, ISNULL(A.PARTY_LEDGER_NAME,'N/A') AS PARTY_LEDGER_NAME ,
                  ISNULL(A.LEDGERNAME,'N/A') AS LEDGERNAME,
                  K.ITEM_SUPPLIER_ORDER_ID,
                  A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,
A.COST_CENTER,A.XYZ
                  FROM CTE1 A
                  LEFT OUTER JOIN ITEM_SUPPLIER_ORDER K
                  ON  A.ORDER_NO = K.ORDER_NO
                  AND A.VOUCHER_TYPE_NAME=K.VOUCHER_TYPE_NAME
                  AND A.DIM_DATE_ID=K.DIM_DATE_ID
                  AND A.COMPANY_NAME=K.COMPANY_NAME
                  AND A.PARTY_LEDGER_NAME=K.PARTY_LEDGER_NAME
AND A.LEDGERNAME=K.LEDGERNAME
AND A.COST_CENTER=K.COST_CENTER
                  GROUP BY
                  --A.DIM_ITEM_ID,
                   A.ORDER_DATE, A.ENTERED_BY,
                  A.ORDER_NO, A.VOUCHER_TYPE_NAME, K.ITEM_SUPPLIER_ORDER_ID,
                  ISNULL(A.PARTY_LEDGER_NAME,'N/A')
                  ,A.COMPANY_NAME,A.DIM_DATE_ID,A.CATEGORY,
A.COST_CENTER,A.XYZ,
ISNULL(A.LEDGERNAME,'N/A')
            )
           
           
           
           
           
            ,
CTE3 AS (
                  SELECT COUNT (DISTINCT(ORDER_NO))AS COUNT_ORDER_ID,
                  ITEM_SUPPLIER_ORDER_ID
                  FROM CTE2
                  GROUP BY
                  ITEM_SUPPLIER_ORDER_ID
            ),
CTE4 AS (
                        SELECT A.AMOUNT_SUBSTRING2,
A.AMOUNT_SUBSTRING_AFTER$
,A.ORDER_DATE, A.ENTERED_BY,
                  A.ORDER_NO, A.VOUCHER_TYPE_NAME, A.PARTY_LEDGER_NAME ,
                  A.ITEM_SUPPLIER_ORDER_ID,
                  A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,
A.COST_CENTER,A.XYZ,A.LEDGERNAME
                  FROM CTE2 A
                  JOIN CTE3 B ON
                   A.ITEM_SUPPLIER_ORDER_ID=B.ITEM_SUPPLIER_ORDER_ID
                  WHERE B.COUNT_ORDER_ID=1
            ),
CTE5 AS (
                        SELECT a.AMOUNT_SUBSTRING2,
a.AMOUNT_SUBSTRING_AFTER$,A.ORDER_DATE, A.ENTERED_BY,
                  A.ORDER_NO, A.VOUCHER_TYPE_NAME, A.PARTY_LEDGER_NAME ,
                  A.ITEM_SUPPLIER_ORDER_ID,
                  --A.DIM_ITEM_ID,
                  A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,
A.COST_CENTER,A.XYZ,A.LEDGERNAME
                  FROM CTE2 A
                  JOIN CTE3 B ON
                   A.ITEM_SUPPLIER_ORDER_ID=B.ITEM_SUPPLIER_ORDER_ID
                  WHERE B.COUNT_ORDER_ID>1
                  AND A.ENTERED_BY LIKE '%ADMIN%'
            ),
CTE6 AS (
                 
                  SELECT a.AMOUNT_SUBSTRING2,
a.AMOUNT_SUBSTRING_AFTER$,A.ORDER_DATE, A.ENTERED_BY,
                  A.ORDER_NO, A.VOUCHER_TYPE_NAME, A.PARTY_LEDGER_NAME ,
                  A.ITEM_SUPPLIER_ORDER_ID,
                  A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,
A.COST_CENTER,A.XYZ,A.LEDGERNAME
                   FROM CTE4 A
            UNION ALL
                  SELECT a.AMOUNT_SUBSTRING2,
a.AMOUNT_SUBSTRING_AFTER$,A.ORDER_DATE, A.ENTERED_BY,
                  A.ORDER_NO, A.VOUCHER_TYPE_NAME, A.PARTY_LEDGER_NAME ,
                  A.ITEM_SUPPLIER_ORDER_ID,
                  A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,
A.COST_CENTER,A.XYZ,A.LEDGERNAME
                   FROM CTE5 A
            )
     
MERGE ITEM_PURCHASE_ORDER AS TARGET
USING CTE6 AS SOURCE
                  ON(TARGET.ITEM_SUPPLIER_ORDER_ID = SOURCE.ITEM_SUPPLIER_ORDER_ID
                  AND TARGET.DIM_DATE_ID= SOURCE.DIM_DATE_ID
                  AND TARGET.COMPANY_NAME=SOURCE.COMPANY_NAME
                  --AND TARGET.PARTY_LEDGER_NAME=SOURCE.PARTY_LEDGER_NAME
                  --AND TARGET.LEDGERNAME=SOURCE.LEDGERNAME
                  )
                 
WHEN NOT MATCHED BY TARGET
                  THEN INSERT(AMOUNT_SUBSTRING2,AMOUNT_SUBSTRING_AFTER$,
                  --ORDER_DATE,
                  ENTERED_BY,
                  --ORDER_NO,VOUCHER_TYPE_NAME,
                  --PARTY_LEDGER_NAME,
                  ITEM_SUPPLIER_ORDER_ID,
                  DIM_DATE_ID,
                  COMPANY_NAME
                  --,CATEGORY,COST_CENTER,XYZ,LEDGERNAME
                  )
                  VALUES(SOURCE.AMOUNT_SUBSTRING2,SOURCE.AMOUNT_SUBSTRING_AFTER$,
                  --SOURCE.ORDER_DATE     ,
                  SOURCE.ENTERED_BY ,
                  --SOURCE.ORDER_NO ,
                  --SOURCE.   VOUCHER_TYPE_NAME ,
                  --SOURCE.   PARTY_LEDGER_NAME ,
                  SOURCE.     ITEM_SUPPLIER_ORDER_ID  ,
                  SOURCE.     DIM_DATE_ID ,
                  SOURCE.COMPANY_NAME
                  --SOURCE.CATEGORY,
                  --SOURCE.COST_CENTER,
                  --SOURCE.XYZ,
                  --SOURCE.LEDGERNAME
                  )
WHEN MATCHED
                  THEN UPDATE SET TARGET.AMOUNT_SUBSTRING2= SOURCE.AMOUNT_SUBSTRING2
                  ,TARGET.AMOUNT_SUBSTRING_AFTER$=SOURCE.AMOUNT_SUBSTRING_AFTER$;

Thereafter the ETL resource ran the uery execution panel to check for the joins and the index usage. The resource identified the usage of hash joins wherever possible.


The ETL resource looked for the  following:

·         Hash match cost
·         Table scan cost.
Within the hash match cost, the etl resource looked for the following:
·         The join creating the hash match.
·         The table size
·         The total costs including the subtree cost. 



The same operation shall be repeated for the table scans:

Hereafter the ETL resource identified the join to be worked upon from the given sql.


Steps to convert the hash join to nested loop join:
  • ·         Take the cte output and store in a temporary table in the same order as the joins.
  • ·         Create index in the same order as joins.




Developing the temporary table for the second table alo9ng with the indexes in the same order in which the joins have been done.





Hereafter the ETL developer has 2 options:
  •          Use the join conditions
  •          Change the join conditions to where clause




Understanding the impact of where clause.




Using the join conditions instead of the where clause: in our case instead of using the join conditions instead of where clause the optimization process remains the same.





Summary of the findings:
·         In converting the hash joins to nested loops, the maximum memory is used by the following:
o   Sort operation
o   Index seek: indexed nested loop scan
o   Table scan
The cost reduction comes in nested loop join been used instead of the hash join



For the next steps regarding the join optimization, please contact the following:
Name: Apoorv Chaturvedi
Email: apoorv@mnnbi.com
Phone: +91-8802466356
Website: www.mnnbi.com
Prepared by Ankur Gupta(Intern): https://www.linkedin.com/in/ankur-gupta-3b9a71179

Initial and Incremental data Load Template by M&N Business Intelligence-SAP Data Services

  INCREMENTAL LOAD/CREATION OF DIMENSION TABLE LOGIC AT SAP DATA SERVICES END CLIENT In this particular document we shall be looking at the ...