Best in class reporting solutions from ERPs

Wednesday, 30 January 2019

Optimizing the query execution times by converting the hash joins to merge joins

The approach of the etl team is to optimize the sql given by the end clients without doing any of the following:

Separating the transaction and the reporting layers
Developing another transaction layer.

In this module we shall analyze the joins that where used as part of the execution process. Thereafter we shall optimize the joins by converting hash joins(no index usage) to nested loop joins(index usage from one table) and thereafter to merge joins(optimize index usage from both the tables).

For the process of join optimization, the etl resource chose the stored procedure sp_item_purchase_order from the datawarehouse load of turbodata as given under.

DECLARE @SUPPLIER INT;

SET @SUPPLIER=(SELECT DIM_SUPPLIER_ID FROM DIM_SUPPLIER WHERE SUPPLIER_NAME LIKE 'N/A');

UPDATE [STATUS] SET [DISPLAYTEXT]='ITEM_PURCHASE_ORDER LOAD DONE',[COMPLETED]=66;

WITH CTE1 AS(

SELECT A.COMPANY_NAME,A.ORDER_NO,A.PARTY_LEDGER_NAME,A.ENTERED_BY,ORDER_DATE,

A.AMOUNT,A.AMOUNT_SUBSTRING,A.AMOUNT_SUBSTRING2,A.AMOUNT_SUBSTRING_AFTER$,A.CATEGORY,A.COST_CENTER,

A.VOUCHER_TYPE_NAME,A.XYZ,

ISNULL(D.DIM_DATE_ID,1) AS DIM_DATE_ID,

LEDGERNAME

FROM STG_DAYBOOK A

LEFT OUTER JOIN DIM_DATE D

ON A.ORDER_DATE=D.DATE_DESC

WHERE (A.VOUCHER_TYPE_NAME IN

(SELECT DISTINCT VOUCHER_TYPE_NAME

FROM STG_VOUCHER_TYPE

WHERE VOUCHER_TYPE IN ('PURCHASE')))

--AND A.IS_DEEMED_POSITIVE LIKE '%YES%'

)

CTE2 AS (

SELECT SUM (CAST(A.AMOUNT_SUBSTRING2 AS FLOAT)) AS AMOUNT_SUBSTRING2

,SUM(CAST(A.AMOUNT_SUBSTRING_AFTER$ AS FLOAT)) AS AMOUNT_SUBSTRING_AFTER$

,A.ORDER_DATE, A.ENTERED_BY,

A.ORDER_NO, A.VOUCHER_TYPE_NAME, ISNULL(A.PARTY_LEDGER_NAME,'N/A') AS PARTY_LEDGER_NAME ,

ISNULL(A.LEDGERNAME,'N/A') AS LEDGERNAME,

K.ITEM_SUPPLIER_ORDER_ID,

A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,

A.COST_CENTER,A.XYZ

FROM CTE1 A

LEFT OUTER JOIN ITEM_SUPPLIER_ORDER K

ON A.ORDER_NO = K.ORDER_NO

AND A.VOUCHER_TYPE_NAME=K.VOUCHER_TYPE_NAME

AND A.DIM_DATE_ID=K.DIM_DATE_ID

AND A.COMPANY_NAME=K.COMPANY_NAME

AND A.PARTY_LEDGER_NAME=K.PARTY_LEDGER_NAME

AND A.LEDGERNAME=K.LEDGERNAME

AND A.COST_CENTER=K.COST_CENTER

GROUP BY

--A.DIM_ITEM_ID,

A.ORDER_DATE, A.ENTERED_BY,

A.ORDER_NO, A.VOUCHER_TYPE_NAME, K.ITEM_SUPPLIER_ORDER_ID,

ISNULL(A.PARTY_LEDGER_NAME,'N/A')

,A.COMPANY_NAME,A.DIM_DATE_ID,A.CATEGORY,

A.COST_CENTER,A.XYZ,

ISNULL(A.LEDGERNAME,'N/A')

)

CTE3 AS (

SELECT COUNT (DISTINCT(ORDER_NO))AS COUNT_ORDER_ID,

ITEM_SUPPLIER_ORDER_ID

FROM CTE2

GROUP BY

ITEM_SUPPLIER_ORDER_ID

CTE4 AS (

SELECT A.AMOUNT_SUBSTRING2,

A.AMOUNT_SUBSTRING_AFTER$

,A.ORDER_DATE, A.ENTERED_BY,

A.ORDER_NO, A.VOUCHER_TYPE_NAME, A.PARTY_LEDGER_NAME ,

A.ITEM_SUPPLIER_ORDER_ID,

A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,

A.COST_CENTER,A.XYZ,A.LEDGERNAME

FROM CTE2 A

JOIN CTE3 B ON

A.ITEM_SUPPLIER_ORDER_ID=B.ITEM_SUPPLIER_ORDER_ID

WHERE B.COUNT_ORDER_ID=1

CTE5 AS (

SELECT a.AMOUNT_SUBSTRING2,

a.AMOUNT_SUBSTRING_AFTER$,A.ORDER_DATE, A.ENTERED_BY,

A.ORDER_NO, A.VOUCHER_TYPE_NAME, A.PARTY_LEDGER_NAME ,

A.ITEM_SUPPLIER_ORDER_ID,

--A.DIM_ITEM_ID,

A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,

A.COST_CENTER,A.XYZ,A.LEDGERNAME

FROM CTE2 A

JOIN CTE3 B ON

A.ITEM_SUPPLIER_ORDER_ID=B.ITEM_SUPPLIER_ORDER_ID

WHERE B.COUNT_ORDER_ID>1

AND A.ENTERED_BY LIKE '%ADMIN%'

CTE6 AS (

SELECT a.AMOUNT_SUBSTRING2,

a.AMOUNT_SUBSTRING_AFTER$,A.ORDER_DATE, A.ENTERED_BY,

A.ORDER_NO, A.VOUCHER_TYPE_NAME, A.PARTY_LEDGER_NAME ,

A.ITEM_SUPPLIER_ORDER_ID,

A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,

A.COST_CENTER,A.XYZ,A.LEDGERNAME

FROM CTE4 A

UNION ALL

SELECT a.AMOUNT_SUBSTRING2,

a.AMOUNT_SUBSTRING_AFTER$,A.ORDER_DATE, A.ENTERED_BY,

A.ORDER_NO, A.VOUCHER_TYPE_NAME, A.PARTY_LEDGER_NAME ,

A.ITEM_SUPPLIER_ORDER_ID,

A.DIM_DATE_ID,A.COMPANY_NAME,A.CATEGORY,

A.COST_CENTER,A.XYZ,A.LEDGERNAME

FROM CTE5 A

)

MERGE ITEM_PURCHASE_ORDER AS TARGET

USING CTE6 AS SOURCE

ON(TARGET.ITEM_SUPPLIER_ORDER_ID = SOURCE.ITEM_SUPPLIER_ORDER_ID

AND TARGET.DIM_DATE_ID= SOURCE.DIM_DATE_ID

AND TARGET.COMPANY_NAME=SOURCE.COMPANY_NAME

--AND TARGET.PARTY_LEDGER_NAME=SOURCE.PARTY_LEDGER_NAME

--AND TARGET.LEDGERNAME=SOURCE.LEDGERNAME

)

WHEN NOT MATCHED BY TARGET

THEN INSERT(AMOUNT_SUBSTRING2,AMOUNT_SUBSTRING_AFTER$,

--ORDER_DATE,

ENTERED_BY,

--ORDER_NO,VOUCHER_TYPE_NAME,

--PARTY_LEDGER_NAME,

ITEM_SUPPLIER_ORDER_ID,

DIM_DATE_ID,

COMPANY_NAME

--,CATEGORY,COST_CENTER,XYZ,LEDGERNAME

)

VALUES(SOURCE.AMOUNT_SUBSTRING2,SOURCE.AMOUNT_SUBSTRING_AFTER$,

--SOURCE.ORDER_DATE ,

SOURCE.ENTERED_BY ,

--SOURCE.ORDER_NO ,

--SOURCE. VOUCHER_TYPE_NAME ,

--SOURCE. PARTY_LEDGER_NAME ,

SOURCE. ITEM_SUPPLIER_ORDER_ID ,

SOURCE. DIM_DATE_ID ,

SOURCE.COMPANY_NAME

--SOURCE.CATEGORY,

--SOURCE.COST_CENTER,

--SOURCE.XYZ,

--SOURCE.LEDGERNAME

)

WHEN MATCHED

THEN UPDATE SET TARGET.AMOUNT_SUBSTRING2= SOURCE.AMOUNT_SUBSTRING2

,TARGET.AMOUNT_SUBSTRING_AFTER$=SOURCE.AMOUNT_SUBSTRING_AFTER$;

Thereafter the ETL resource ran the uery execution panel to check for the joins and the index usage. The resource identified the usage of hash joins wherever possible.

The ETL resource looked for the following:

· Hash match cost

· Table scan cost.

Within the hash match cost, the etl resource looked for the following:

· The join creating the hash match.

· The table size

· The total costs including the subtree cost.

The same operation shall be repeated for the table scans:

Hereafter the ETL resource identified the join to be worked upon from the given sql.

Steps to convert the hash join to nested loop join:

· Take the cte output and store in a temporary table in the same order as the joins.
· Create index in the same order as joins.

Developing the temporary table for the second table alo9ng with the indexes in the same order in which the joins have been done.

Hereafter the ETL developer has 2 options:

Use the join conditions
Change the join conditions to where clause

Understanding the impact of where clause.

Using the join conditions instead of the where clause: in our case instead of using the join conditions instead of where clause the optimization process remains the same.

Summary of the findings:

· In converting the hash joins to nested loops, the maximum memory is used by the following:

o Sort operation

o Index seek: indexed nested loop scan

o Table scan

The cost reduction comes in nested loop join been used instead of the hash join

For the next steps regarding the join optimization, please contact the following:
Name: Apoorv Chaturvedi
Email: apoorv@mnnbi.com
Phone: +91-8802466356
Website: www.mnnbi.com
Prepared by Ankur Gupta(Intern): https://www.linkedin.com/in/ankur-gupta-3b9a71179

Wednesday, 26 December 2018

Testing methodology for Turbodata using SSIS

Testing experience-summary

The attached blog indicates the comprehensive testing methodology adopted by Turbodata team for GST and audit reporting purposes.

Business Requirements and Understanding:

The resource understood the business logic requirements for GST reports. Each measure and the value was specified with the end client(GSP partner).

2. Test planning and estimation: For the same the ETL team did the following steps:

a. Arranged for sample Tally data from the end client

b. Looked for the final GST reports to be developed at the ETL end. The numbers from the source system GST reports were matched with the data warehouse GST reports.

3. Designing test cases and preparing test plan: for the numbers auditing purposes, the following was the methodology adopted by the ETL team.

a. The testing was done for the entire fiscal year and fiscal month(the granularity of GST reporting is for the fiscal year and fiscal month)

b. The testing was done across multiple companies

c. The scenarios were developed for GST reports based on ledger entries, voucher type entries. For example how should credit notes be handled, how should sales invoice be handled?

d. The specifications were signed off by the end client(GSP partner)

4. Test execution with Bug closure and reporting:

· The ETL process was run across various fiscal years and fiscal months. For each fiscal year and fiscal month, the variances with regards to the source ERP reporting module was reported to the developers

· Finding the bug: The required business logic because of which error was happening was deducted and the details given to the developer.

1. Summary report and result analysis: the final details were given to the ETL team based on the data audit numbers.

The types of testing that was done was as follows:

· Incremental testing: when the jobs were migrated from development to production, the output of the incremental data was checked. For the same the ETL team used the following:

o Used merge joins and lookup transforms for the same. The non-matching entries from lookup were stored in separate tables

· Data quality testing: the dirty data was checked during data load. For example between stg_daybook_temp and stg_daybook the dirty data was cleansed. The log of the same has been kept for the audit purposes.
For matching the same, the ETL team used the following:

o Fuzzy lookup

· Data transformation testing: The business rules were validated with the end client. Thereafter the business logic was tested using SQL and SSIS coding. The features in SSIS used for the same were as follows:

o Redirect rows on failure

o Used lookup for finding the unmatched rows

· Data accuracy testing: the output numbers in the reports were matched with the Tally GST reports to check the accuracy of the output data.

· Data completeness testing: for the same, the ETL team did the following:

o For the same, the ETL team did the following:

§ Used the @execute variable to check the row counts

§ Used the following at the dataflow level:

· Redirect rows for transformations and target table loading: the unmatched rows were stored in staging tables for audit purposes.

· At the sequence container level, the etl team used the following connection parameters for the following parameters:

Control flow:

§ Transaction level: required

§ Isolation level: serializable (where possible). To enable rollback of transactions. In case of nightly load after the day’s processing is completed, the etl team can used ‘transaction uncommitted’

§ Enable logging

§ Enable the usage of @executevariable

§ Usage of checkpoints and breakpoints for error handling

o Dataflow:

§ Transaction level: supported

§ Isolation level: transaction committed (where possible).

§ Enable logging(use parent settings)

§ Enable the usage of @executevariable

§ Error logging: use parent settings

§ Fail parent on failure : true
Metadata Testing: for the same, the ETL team used the following:

o Usage of log tables

o Used the data viewer during testing process

For metadata testing, Data Profiling transform can be used.

· Application Upgrades: this entailed migration purposes across multiple end clients. For the same, the ETL team tested with the following:

o Multiple package configurations

o Multiple project deployment configurations: using manifest files

· GUI/Navigation Testing: the final reports were tested for report reload and report refresh times

· Source to Target Testing (Validation Testing): for the same, the filter parameters were checked while extraction to staging area. The following were the methods adopted for the same:

o The extractor C# code was checked for the required data columns

Methodology adopted for testing: The source to target mappings given by end clients was taken as a base. Thereafter the testing process was started.

Test case scenarios:

Serial number	Mapping doc validation	M&N BI test case
1	Verify whether the business logic was mentioned	Every measure and dimension was tested for source mapping
2	Validation	1.) Null: Derived column(findstring) 2.) Source and target data type to be same: Data conversion 3.) Validate the name of columns in the table against mapping doc: manually checked .
3	Constraints	All the target tables had primary key constraints In the intermediary tables, the composite key constraints were checked using Row_Number().
4	Data consistency issues	. 1.) Stg_daybook_temp to stg_daybook load. Used fuzzy lookup, replace and findstring for the same
5	Completeness issues	1. Confirm all the data is loaded: @valuevariable, log tables 2. Check for any rejected records: dataflow level(audit tables) 3. Check for data not to be truncated: warnings in SSI, output varchar(255) 4. Boundary value analysis: WIP. 5. Compare unique value of key fields between source and target: audit table(sql not in statement used)
6	Correctness issues	1.) Data that is misspelled: manual checking, looku 2.) Null, unique, non null: referential key constraints were checked
7	Data quality	1.) Null check: handled above(conditional split) 2.) Date check: usage of parameters, table variables Use Data profiler task
8	Duplicate check	1.) Check for unique key, primary key constraints in all tables 2.) Sort transform: remove duplicates
9	Date validation	1.) Know the row creation date: insert_ts, update_ts and load_ts to be added. Used table variable and parameters for date entry
10	Data validation	1.) To validate the complete data set in source and target table minus a query in a best solution: staging tables, audit tables 2.) Difference between source and target: redirect the rows
11.	Data cleanliness	1.) Remove unnecessary columns: optimize the dataflow

Types ETL bugs taken into account:

Serial number	Types of ETL bugs	details
1.)	User interface bug	1.) Extractor of Tally and SAP: interface bug testing for and from dates
2.)	Boundary value analysis	1.) The maximum and minimum dates in stg_daybook_temp and stg_daybook should be between the input dates in the ETL extractor. Use data profiler task
3.)	Equivalence class partitioning(the data type should be specific)	HSN CODE should be 16 characters. If less or more than 16 characters then an error to be generated. Use data profiler task
4.)	Input/output bugs	Related to above
5.)	Calculation bugs	Check physically
6.)	Load condition bugs: Not allow multiple users	Configuration file to be checked
7.)	Race condition bugs	Test the system with full load. Check for incremental data load

Prepared by:

Ishwar Singh

Lead ETL Developer
Linkedin profile: https://www.linkedin.com/in/ishwar-singh-034324139/

Website: www.mnnbi.com

Email address: apoorv@mnnbi.com

Phone: +91-8802466356

Best in class reporting solutions from ERPs

Visit the new site

Automate Your Tally Reports

Wednesday, 30 January 2019

Optimizing the query execution times by converting the hash joins to merge joins

Wednesday, 26 December 2018

Testing methodology for Turbodata using SSIS

Featured Posts

How a large education institution reduced its dependence on shortcut keys in Tally for report export-AI and automation

Our Most Popular Post

Get best in class Powerbi Training

Automate Your Tally Reports Extract and Tally data insert

Blog Archive

Labels