VIVEK VIDHYA
7 min readJan 17, 2021

PROJECT 4: Retail Analysis with Walmart Data

DESCRIPTION

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Dataset Description

This is the historical data that covers sales from 2010–02–05 to 2012–11–01, in the file Walmart_Store_sales. Within this file you will find the following fields:

  • Store — the store number
  • Date — the week of sales
  • Weekly_Sales — sales for the given store
  • Holiday_Flag — whether the week is a special holiday week 1 — Holiday week 0 — Non-holiday week
  • Temperature — Temperature on the day of sale
  • Fuel_Price — Cost of fuel in the region
  • CPI — Prevailing consumer price index
  • Unemployment — Prevailing unemployment rate

Holiday Events

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13

Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Analysis Tasks

Basic Statistics tasks

  1. Which store has maximum sales
  2. Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation
  3. Which store/s has good quarterly growth rate in Q3’2012
  4. Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together
  5. Provide a monthly and semester view of sales in units and give insights

Statistical Model

For Store 1 — Build prediction models to forecast demand

  1. Linear Regression — Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.
  2. Change dates into days by creating new variable.

>Importing packages and CSV data

>EDA(Exploratory Data Analysis | Data Audit)

> project.shape

O/P

(6435, 8)

>project.columns

O/P

Index([‘Store’, ‘Date’, ‘Weekly_Sales’, ‘Holiday_Flag’, ‘Temperature’,

‘Fuel_Price’, ‘CPI’, ‘Unemployment’],

dtype=’object’)

>project.head(5)

O/P

>project.info()

o/p

>project.describe().T

o/p

>Removing the Date column inorder to apply function -to get detailed info:

>Creating custom summary using function:

def var_summary(x):

return pd.Series([x.count(), x.isnull().sum(), x.sum(), x.mean(), x.median(), x.std(), x.var(), x.min(),x.quantile(0.01),x.quantile(0.05),x.quantile(0.10),x.quantile(0.25),x.quantile(0.50),x.quantile(0.75), x.quantile(0.90),x.quantile(0.95), x.quantile(0.99),x.max()],

index=[’N’, ‘NMISS’, ‘SUM’, ‘MEAN’,’MEDIAN’, ‘STD’, ‘VAR’, ‘MIN’, ‘P1’ , ‘P5’ ,’P10' ,’P25' ,’P50' ,’P75' ,’P90' ,’P95' ,’P99' ,’MAX’])

>project1.apply(lambda x:var_summary(x)).T

>Handling Missings values in weekly sales

> Fill with mean

project1[‘Weekly_Sales’]=project1[‘Weekly_Sales’].fillna(project1[‘Weekly_Sales’].mean())

project1

Basic Statistics tasks

1.Which store has maximum sales

>Store 20 has max number of amount 301397792.46000004

2. Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation

>coefficient of mean to standard deviation

>Store 14 has maximum standard deviation of 317569.9494.

3.Which store/s has good quarterly growth rate in Q3’2012

>growthrateQ3=dtq3.groupby(“Store”).Weekly_Sales.sum()

>Store 4 has good quarterly growth rate Q3 2012 =25652119.35

4.Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together

Output:

>Therefore Thanksgiving days has more sale as compared to non-holiday.

5.Provide a monthly and semester view of sales in units and give insights

>MONTHLY:

>We can observe from the Monthly Sales Graph that highest sum of sales is recorded in between jan-2011 to march-2011.

>semester:

>We can Observe from Semester Sales graph that at beginning of 1st sem of 2010 and 1st sem of 2013 sales are lowest .

RESULT OF Basic Statistics tasks:

  1. Store 20 has max number of amount 301397792.46000004
  2. Store 14 has maximum standard deviation of 317569.9494.
  3. Store 4 has good quarterly growth rate Q3 2012 =25652119.35
  4. Thanksgiving days has more sale as compared to non-holiday.
  5. >>We can observe from the Monthly Sales Graph that highest sum of sales is recorded in between jan-2011 to march-2011.

>>We can Observe from Semester Sales graph that at beginning of 1st sem of 2010 and 1st sem of 2013 sales are lowest .

Statistical Model

1.Linear Regression — Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.

>Exploring data

# Distribution of variables

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

> Categorizing dependent variable as weekly sales and independent variable as fuel price,CPI,Unemployment

>Checking the normality of dependent variable and independent variable

>Checking the normality of Weekly_sales

>Checking the normality of CPI

>Checking the normality of unemployment

>Checking normality in Fuel_Price

>>Weekly_Sales is uniformly distrubuted but from0–1 is uncresing and then graduly decreasing.Fuelprice and unemployment is uniformly distributes.CPI is increasing from 120–140 and agin from 220–240.

>Visualizing the pairwise correlation:

>Finding the correlation the given data:

>Have a good correlation.

Building Regression Model

>Importing from statsmodels.formula.api package

>Here difference b/w R-squared and Adj R-squared value is less,but F-statistic is very low .Also P value of Fuel price is a bit greater.So Creating new model by ignoring Fuel price

OUTPUT:

>Here difference b/w R-squared and Adj R-squared value is 0. F-statistic is improved.Also P value is very less.So P value is less .Therefore Rejecting the null hypothesis.Hence CPI and unemployment and fuel price is affecting the sales.

>Checking f value,model parameter,confidence level and p-value

>Parameters estimated are considered to be significant if p-value is less than 0.05

This indicates intercept and cpi and unemployment are both significant parameters. And the parameter estimates can be accepted.

So, the linear model is Sales=B0+B1 CPI +B2 UNEMLOYMENT

>Evaluating model accuracy and predicting data:

>Calculating RMSE(root mean square error)

Conclusion :

Difference between Rsquared and Adj R squared value is less ,with improved F-statistic value

Also P-value is also which clearly rejects the N-hypothesis.

2.Change dates into days by creating new variable.

RESULT OF STATISTICAL MODELS:

1.Parameters estimated are considered to be significant if p-value is less than 0.05

This indicates intercept and cpi and unemployment are both significant parameters. And the parameter estimates can be accepted.

So, the linear model is Sales=B0+B1 CPI +B2 UNEMLOYMENT

2.Inserted one more variable Named Day in dataset.

contact detail: 7358265934,vivekvidya22@gmail.com

VIVEK VIDHYA

I ONLY HAVE ONE DRAWBACK,NOTHING CAN STOP ME WHEN I AM RESEARCHING SOMETHING.LIKE ONCE DECIDED ,SHOULD BE DONE.