Market Basket Analysis is a modelling technique based upon the theory that if we buy “ ITEM A” then there is a probability of buying “ITEM B”.
Purpose of
Market Basket Analysis through Turbodata
The purpose of market basket analysis
is to determine what products customer purchase together .i.e. if a customer
purchases milk then there is possibility
that the customer will also buy
sugar.MBA(Market basket analysis) is used to analyze the customer
behavior.
Why Market
Basket Analysis in Turbodata?
· Data has been cleansed and
consolidated in Turbodata.
· Maximum data redundancies
have been removed and joins have been minimized. All the business logic from
the ERPs has been pre coded. This process helps us in the following ways:
o The reporting team can use
parameterized queries to get the required outputs. There is no need to store
all the data in pre aggregated form. This is important since there are a large
number of item combinations that are possible
o Turbodata’s market basket
analysis solution offers analysis at the reporting layer without the use of
expensive reporting tools.
o The solution is available
in both Tableau and PowerBI.
o Multiple users can access the report. Turbodata offers analysis at any
point in time, across any number of locations or any combination of items.
o Facility for large scale
deployment of the code because of the usage of table variables and functions.
STEPS:
STEP1
: A master procedure with two
input parameters was created with date.
STEP 2: Five variables was created for the calculation of confidence ,lift and support.
· @n_total :total number of order
· @n_dual: number of
invoices having more than ONE item
· @n_item1:number of
invoices having ITEM1
· @n_item2: number of
invoices having ITEM2
· @n_item1_item2: number of
invoices having both ITEM1 and ITEM2.
STEP 3: Calculate Confidence ,
Support and Lift value.
So for this we have been created a
function instead of procedure because:
·
Function is compiled and executed every time whenever it is
called .It is just like a formula we use.
·
We can add the parameter according to the requirement.
·
When we need only input parameter.
There are two types of function we
use in our code
1. Scalar valued function
2. Table valued function
Scalar valued function
Scalar valued function are used when
we have to returns a single
value(integer,decimal or any).it might or might not be related to tables in our
database.
Table valued function
Table valued function
returns the specified columns for row in our table meeting the given criteria.
So for the above
process we use the following solutions:
Calculation of @n_total
N_total is the sum of
total orders.Total number of orders should be come from item_supplier_order at a particular date.so
we join item_supplier_order table with dim_date and we take only that date
which is less than or equal to the @date.
Here is the snapshot for total order.
Calculation for@n_dual
This is the function which we have been created
for invoices having more than one order.Here we need partyledgername from
item_supplier_order,item from
item_sales_order and date from dim_date.so we join item_supplier_order with item_sales_order on the basis of
composite key ,the result will further join with dim date.
For more than one item we take the
count of dim_item_id having count greater than one.
Attached is the snapshot from the Turbodata:
Calculation for @n_item1
This is the function we
created for the item1.
We took order_no from
item_supplier_order,dim_item_id from item_sales_order and item from dim_item having the condition
that count of dim_item_id is greater than one.
Calculation for @n_item2
Calculation for both the items @n_item1_item2
This is the function
created for both the ITEM1 and ITEM2.
Confidence is defined as
division of the frequency of both the item to item1.
i.e., Confidence=@n_item1_item2/@n_item1
Support is defined as the division of occurrence of
both the items to total.
i.e
support=2n_item1_item2/@n_total
Lift is defined as the
division of the support of both the item to muliplication of support A and
support B.
i.e
Lift =@support/@support_x * @support_y
How the Market Basket Analysis code has been developed over compressed database.
How Turbodata helped lower the costs of developing a datawarehouse and helped the end clients do predictive analytics with quickness and ease-applicable for retail sales and inventory(Website: www.mnnbi.com)
Purpose of the development of the product: The Turbodata team intends to reduce the costs of the analytic solutions by creating a single platform for ETL, Reporting, Reporting development and predictive analytics. The team also intends to provide the best in class analytics on the same machine on which the ERP is running or with the addition of minimum hardware requirements for the end client. This has been done to develop scalable systems that can be deployed over a large number of customers(with limited budgets) with ease(deployment, delivery and usage) and convenience(maintenance).
The end goal is to increase derisking and predictability for the end clients at lower costs.
Methodology for achieving the required ends for the end client:
· Turbodata adopted the Inmon methodology for datawarehouse development so that multiple data sources could be added onto the same datawarehouse. That is the change from one data source to another was done with ease. More details on the attached web page link: http://mnnbi.com/mnnbi_tallydataconsolidation.html
o The benefits of the normalization of data were as follows:
§ The incremental data load took minimum time and had minimum impact on the source system. The ETL team was able to commit the incremental data load to a maximum of 2GB RAM from multiple source systems. The source systems did not hang with the incremental data load working.
§ Error handling was done with ease at staging layer.
§ Massive data compression took place due to reduced data redundancies.
§ The business logic was coded between staging and the ODS layers thereby reducing the length of the final sql code.
The attached video shows a more detailed description of the benefits listed above:
The joins were reduced in the data mart layer(over which a reporting layer was built).
The ETL team was able to develop extremely complex reports using the datawarehouse as in the attached sample:
Due to the data compression for most projects the ETL team are able to bring the data within 1 GB. Hence the desktop version of Microsoft Power BI could be used free of cost for the end client.
Reducing the cost of predictive analytics solutions
Most of the end clients use high end predictive tools over the datawarehouse/ over the direct data extract from various source databases. With large datasets predictive analytics using in memory solutions entails high usage of RAM. The ETL team has gone around this issue in the following manner:
o A seamless environment was created for ETL, reporting and thereafter predictive analytics on SQL/C# and .Net. The reasons for the same are attached herewith:
§ Maintenance becomes easier since there is a single platform for all aspects.
§ The cost comes down since the resources to be used for ETL can also be used for predictive analytics
§ Error handling becomes very easy since errors can be captured before in the
Hypothesis testing
Based on the hypothesis testing, the ETL team developed ARIMA analysis and Market Basket analysis in SQL using seamless integrated set of stored procedures. That is the ARIMA analysis flowed from the datawarehouse A,B,C categorization. The ETL team thus reduced the requirement for high end R and Python developers to code over the datawarehouse thereby presenting a seamless solution to the end client on a 8GB RAM machine.
Benefits to the end client:
· The end client gets immediate and confirmed peace of mind and satisfaction through immediate deployment of predictive and forecasting analytics modules.
· No additional hardware/software requirements need to be taken
· The costs are way lower for the end client.
· Large scale deployment is possible with the given set of solutions.
Please check the attached video for the same:
A more detailed video is attached herewith:
The PowerBI code has been developed
to enable the following:· Run the market basket
analysis code based on 2 input item parameters across any filter date and
location ranges.
· Removes the dependence on
SQL server for the implementation of Market basket analysis.
· Faster execution times
since the measures and columns are calculated on a columnar basis instead of
row by row basis.
The attached is the methodology of Market basket
analysis using PowerBI over Turbodata
Step2:Three parameters are declared.i.e.
item_1,item_2,date_1.This has been done through Power Query Editor.
For creating parameters:
·
Open power query editor
·
Go to manage parameter
·
Create a parameter
Set the parameter of
date_1 in item_supplier_order
Step 3:we have to
calculate support,confidence and lift value.
calculation of support :
Support=f(X,Y)/total
Where f(X,Y) is the instances
of occurrance both the “ITEM A” and ”ITEM B” in the invoices for the given time
period.
Total is the count of the
total sales order nos. for the given time period
Attached is the snapshot
from dax function for f(X,Y):
Snapshot for total:
Snapshot for the calculation of support:
Snapshot for the calculation of item A only:
Calculation for the confidence:
Confidence =F(X,Y)/F(X)
Snapshot for the calculation
of confidence:
Where P_X_Y is the occurrence of both
the item
And P_X is the order no where there
is item A.
Snapshot for the calculation of P_X:
Calculation for lift:
Lift =support_1/(support_x*support_y)
Where
support_x =f(X)/total
Support_y=f(Y)/total
Snapshot for the calculation of lift
value:
Snapshot for the calculation of
support_x:
Snapshot for the calculation of
support_y:
Attached is the summary of final
output for the given set of input items(CPU and ALMIRAH).
Market Basket Analysis through
Tableau
Market Basket Analysis has been done
through Tableau .
There are four data sources we have
been taken for the calculation of market basket analysis.
1. CPU
2. ALMIRAH
3. MARKET_BASKET
4. ORDER_MASTER
We have to calculate the
lift,confidence and support value in Market Basket Analysis.So we have
take the two items ‘CPU’ and ‘ALMIRAH’.
Make three parameters.
1. Date_1
2. Item_1
3. Item_2
We have take the order no. where the
CPU has been purchased.
So we join the item_supplier_order
with item_sales order on the basis of item_supplier_order_id , dim_date with
dim_date_id and dim_item with dim_item_id.
We have pass the parameter through
Custom SQL.
Attached is the snapshot from the
Tableau for the date parameter:
Similarly attached is the snapshot
from Custom SQL to pass the item_1(CPU) parameter:
Attached is the snapshot from Custom
SQL to pass the item_2(ALMIRAH) parameter:
MARKET_BASKET is the database where
we calculate the order no where both item_1 and item_2 both are purchased.i.e.
“ALMIRAH”,”CPU”.
This database is used to calculate
total number of invoices(order_no).
For the calculation of the order_no
of item_1 we have taken the count distinct of item_1
For the calculation of the order_no
of item_2 we have taken the count distinct of item_2
Attached is the snapshot for the
calculation of the order_no for both the
item_1 and item_2.
Attached is the calculation for the
total order_no.
Based on the above calculation we
calculate the confidence ,lift and support value.
Attached is the snapshot for the calculation of confidence:
Confidence is the value that if a
customer purchase item “A” then what likelihood
he will buy item “B”.
Attached is the snapshot for the calculation of support:
Support value calculation for how the
strong is the base.
Attached is the snapshot for the calculation of lift:
Attached is the snapshot of the
resultant:
We take two items “CPU” and ”ALMIRAH”
which cannot be a combo so its lift value is less than 1.
Note:
Lower the value of lift lower is the
strength of the combo and higher the value of lift higher is the strength of
the combo that customer like.
ARIMA model
in R
Data
is made of noise and signal .
In
short time period the probability of noise in the data is greater than signal. For
e.g. if we analyze the data on hourly basis ,the probability of noise is 99.5% and that of
signal is 0.5%.
In
long time period , the proportion of noise reduces and the proportion of signal
increases.For e.g. if we analyze the data on monthly basis, probability of
noise could be 50% and that of signal is 50%.
The
purpose of tike series forecasting is to segregate the noise from the signal
for better management decision making.
ARIMA
stands for :
AR:Auto
regressive(p)
Y(t)= c+alpha*y(t-1): Understand the time
dependency
I:integration(d)
Y(t)=
c+alpha*y(t-1)+beta*y(t-2)+error: understand behavior of error term. Check for
stationarity(check whether error is hiding signal patterns).
MA :Moving
Average(q)
Y(T)=c+alpha*e(t)+beta*e(t-1). MA
indicates the dependence of y(t) on the errors in the previous time periods
Components of ARIMA Model:
} Trend: A long-term
increase or decrease in the data is referred to as a trend. It is not
necessarily linear. It is the underlying pattern in the data over time.
} Seasonal: When a series is
influenced by seasonal factors i.e. quarter of the year, month or days of a
week seasonality exists in the series. It is always of a fixed and known
period. E.g. – A sudden rise in sales during Christmas, etc.
} Cyclicity: When data
exhibit rises and falls that are not of the fixed period we call it a cyclic
pattern. For e.g. – duration of these fluctuations is usually of at least 2
years.
Assumption of ARIMA model
·
Data should be stationary: the error terms should not have
time dependency
·
Data should be univariate: the regression model is based on
only one variable.
Sample code:
data =
read.csv('http://ucanalytics.com/blogs/wp-content/uploads/2015/06/sample_data.csv')
data = ts(data[,2],start =
c(2003,1),frequency = 12)
plot(data, xlab='Years', ylab = 'sample)
The above code is from R. it reads
the data which is csv type. ’ts’ is the time series function which reads the
data starting from first month of 2003
and read plot the data of every month.
Where x-axis is for the “years” and
y-axis is for the “sample”.
plot(diff(data),ylab='Differenced sample
Sales')
This is for the difference of y(t)-y(t-1).
plot(log10(data),ylab='Log (sample
Sales)')
This graph is for the log value of
given sample sale.
plot(diff(log10(data)),ylab='Differenced
Log (sample Sales)')
This graph is for the log difference
of sample sales.
ACF:ACF stands
for auto-correlation function. It describes how well the present value of the series
is related with its past values. A time series can have components like trend,
seasonality, cyclic and residual. ACF considers all these components while
finding correlations hence it’s a ‘complete auto-correlation plot’.
PACF:PACF stands
for partial auto correlation function. Basically instead of finding correlations of present
with lags like ACF, it finds correlation of the residuals. So if there is any hidden information in the residual
which can be modeled by the next lag, we might get a good correlation and we
will keep that next lag as a feature while modeling.
Attached is the code :
par(mfrow = c(1,2))
acf(ts(diff(log10(data))),main='ACF sample
Sales')
pacf(ts(diff(log10(data))),main='PACF
sample Sales')
Epilogue
The ETL team has been inspired by the following management books:
· ‘Profit Beyond Measure’ by Thomas Johnson and Anders Brohms
· McKinsey Mind by Ethan Rasiel and Paul Friga.
· Blue Ocean Strategy by W. Chan Kim and Renee Mauborgne
· Better India, Better World by Narayana Murthy
In case of any requirements please contact the following:
Name: Apoorv Chaturvedi
Email: support@mndatasolutions.com;mndatasolutionsindia@gmail.com;support@turbodatatool.com
Phone: +91-8802466356
No comments:
Post a Comment