Predict Store Format and Produce Sales Forecast

Kai SS
Jul 21, 2020
4 min read

Updated: Jul 31, 2020

#predictive_analytics #segmentation #k_means_clustering #alteryx #tableau

Project Background

The client’s company has 85 grocery stores and is planning to open 10 new stores at the beginning of the year. When I received the task, all stores were using the same store format for selling their products. The company has treated all stores similarly, shipping the same amount of product to each store. This has begun to cause problems as stores are suffering from product surpluses in some product categories and shortages in others. I have been asked to provide analytical support to make decisions about store formats and inventory planning.

Research Objectives

Task 1: Determining Store Format

To remedy the product surplus and shortages, the company wanted to introduce different store formats. Each store format would have a different product selection in order to better match local demand. I was asked to:

Determine the optimal number of store formats based on sales data.
Use a K-means clustering model.
Segment the 85 current stores into the different store formats.

Task 2: Store Format for New Stores

The grocery store chain planned to have 10 new stores opening up. The company wanted to determine which store format each of the new stores should have. However, we didn’t have sales data for these new stores yet, so we had to determine the format using each of the new store’s demographic data. I was asked to:

Develop a model that predicts which segment a store falls into based on the demographic and socioeconomic characteristics of the population that resides in the area around each new store.
Use a 20% validation sample with Random Seed = 3 when creating samples with which to compare the accuracy of the models. Make sure to compare a decision tree, forest, and boosted model.
Use the model to predict the best store format for each of the 10 new stores.

Task 3: Forecasting Sales

Fresh produce has a short life span, and due to increasing costs, the company wants to have an accurate monthly sales forecast.

I was asked to prepare a monthly forecast for produce sales for the full year of 2016 for both existing and new stores.

The raw data before analysis please see below:

Solutions

I used Alteryx for all the data analytics in this project.

Task 1: Determining Store Format

Q1. What is the optimal number of store formats?

A: The optimal number of store formats is 3.

I ran K-Means clustering model and used the median and spread of the Adjusted Rand Indices and Calinski-Harabasz Indices. The optimal number of store formats is 3 when both the indices registered the highest median value.

Q2: How many stores fall into each store format?

A: Cluster 1 has 23 stores, cluster 2 has 29 stores while cluster 3 has 33 stores.

Q3. Based on the results of the clustering model, what is one way that the clusters differ from one another?

A: Cluster 1 stores sold more General Merchandise, while Cluster 2 stores sold more Produce, and Cluster 3 stores sold more deli and meat.

Q4. Please provide a Tableau visualization that shows the location of the stores.

A: Below is a Tableau visualization of the stores, I used color to show cluster, and size to show total sales.

Task 2: Store Format for New Stores

Q1: What methodology did you use to predict the best store format for the new stores? Why did you choose that methodology?

A: I chose Boosted Model to predict the best store format for the new stores. I used the Model Comparison tool to compare the Decision Tree, Forest Model and Boosted Model. Both Forest Model and Boosted Model have better accuracy than the Decision Tree. Boosted Model has higher F1 score, so I chose Boosted Model.

Task 3: Forecasting Sales

Q1. What type of ETS or ARIMA model did you use for each forecast? How did you come to that decision?

A: I chose ETS (M,N,M) to forecast the produce sales. I compared the ETS and ARIMA model on Alteryx by taking the following steps:

I used storesalesdata file as the data source. I added a Summarize tool to sum the produce sales, grouped by year and month.
For comparison, I filtered 6 months’ records as a holdout sample.
Then I trained ETS and ARIMA models. The optimal option for ETS model is ETS(M,N,M) and for ARIMA is ARIMA(1,0,0)(1,1,0)[12].
Then I added TS Compare tool to obtain the forecast error measurements against the holdout sample for each model.
Compare the forecast error measurements against the holdout sample of ETS and ARIMA and select the model with the lower forecast error measurements. ETS(M,N,M) turns out to be the better one.

ETS(M,N,M):

ARIMA(1,0,0)(1,1,0)[12]:

ETS model’s accuracy is higher when compared to ARIMA model, and has lower in-sample error measures as well.

We can see in the decomposition plot above there is no trend, seasonal is multiplicative and error is multiplicative. After comparing the results against the holdout sample, the ETS performs better against the ARIMA model.

Q2. Please provide a table of your forecasts for existing and new stores. Also, provide visualization of your forecasts that includes historical data, existing stores forecasts, and new stores forecasts.

A: For existing stores: