SlideShare a Scribd company logo
4
Most read
13
Most read
15
Most read
Drugstores
Retail Sales
Forecasting
BMAN60422
Data Analytics for Business
Decision Making
MSc Business Analytics 2019/2020
Group 4
Overview & Objective
1,115 health & beauty stores around
Germany collected 31 months of data about
the stores’ profile, their daily sales and
customers, promotions, competitions and
holidays during the period.
Our team was tasked with creating a reliable
model to forecast 6-week sales, to enable
managers in increasing productivity,
profitability, customer satisfaction and
meeting demand.
Methodology
Identify problems
& define objective
Set Objectives
● Cleaning, enriching,
transforming the data
● Feature engineering,
selection & data
partition
Data Preprocessing
Selecting, Identifying
and Integrating Data
Exploring, Visualising
and Analysing Data
Forecasting &
Assessing Result
Building, Validating and
Comparing Models
2
Data Identification & Integration
Entity Relationship Diagram: Store.csv with Train.csv → Inner Join based on Store ID
Column Type
Store ID
StoreType Nominal
Assortment Nominal
CompetitionDistance Ratio
CompetitionOpenSinceMonth Nominal
CompetitionOpenSinceYear Ordinal
Promo2 Nominal
Promo2SinceWeek Nominal
Promo2SinceYear Ordinal
PromoInterval Nominal
Column Type
Store ID
DayOfWeek Nominal
Date Interval
Sales Ratio
Customers Ratio
Open Nominal
Promo Nominal
StateHoliday Nominal
SchoolHoliday Nominal
Store TrainStore
ID
3
Handling Missing Data (Values & Rows)
(180 stores missing 184 dates of data) + (1 store
missing 1 date of data)= 33121 missing rows
Creating new rows to complete dataset:
● DayOfWeek
● StateHoliday
● SchoolHoliday
● Promo
● Open: Classify Stores by its opening pattern
● Sales & Customers: Prediction with Random Forest
Follow pattern / schedule of
available stores on the
same date
Missing values replaced by Mean
4
Missing values replaced by Median
Missing values replaced by Median
Missing values replaced with 1
Missing values replaced with 2016
Missing values replaced with 0
Missing values related to Promo2 only occurs when Promo2 = 0
which means the particular store doesn’t have mailing promotion.
Thus we replaced PromoInterval to 0 and PromoSince to indicate
that the promotion has not taken place.
Variable Reduction & Creation
1. Dropped ‘Customer’, for the reason:
● Both Sales and Customers are dependant variables of
which we do not know the values in the future
● There is a very clear positive relationship between the
Customers & Sales (Correlation = 0.89). This reflects a
leakage, and because variables that have leakage of the
target value directly should not be used in the analysis,
Customers should be dropped.
1. Dropped ‘Open’, for the reason:
● Variable Open shows quasi-complete separation in
which no Sales occur when store is closed. We used
Open to filter the training, validation and test set as we
only want to forecast Sales when store is open.
3. Dropped variables which can be replaced by other variables:
● ‘CompetitionOpenSinceMonth’, ‘CompetitionOpenSinceDay’ & ‘CompetitionOpenSinceYear’ are dropped &
replaced with a variable ‘CompetitionAge’ and ‘AffectedByCompetition’.
● ‘Promo2SinceWeek’, ‘Promo2SinceDay’ & ‘Promo2SinceYear’ are dropped & replaced with ‘AffectedByPromo2’.
4. Created variables ‘OpeningType’ (based on Stores’ opening pattern), ‘Month’, ‘DateOfMonth’ & ‘WeekOfYear’ to enrich
the data.
(1) (2)
5
Data Transformation
Transforming: Square Root
CompetitionAgeCompetitionDistance
Transformation
(skew):
- Log (-0.36)
- Sqrt (1.16)
- Box-Cox (-0.03)
Best reduced skew:
Box-Cox
Skew: 2.93 Skew: 9.4
Skew: 0.48
6
Nominal variables are encoded using LabelEncoder to indicate
them as categorical and boolean variables accordingly
Ratio variables are transformed to reduce
the effect of the outliers and their skewness
(2)
Sales vs. Categories
7
(1)
(2)
(3)
Major takeaways:
● Plot (1) shows that store model ‘b’ has the highest sales, given the stores are open.
● Plot (2) shows that assortment level ‘extra’ (b) has the highest sales, given the stores are open.
● Plot (3) reiterates that StoreType = b has the highest sales.
● Plot (3) suggests the combination of StoreType = b and Assortment = c has the highest sales. This is
different from when the two variables are assessed separately (see bullet point 2).
(1) Sales by StoreType
(Open = 1): shows the
sales of store types
when stores are open
(2) Sales by Assortment
(Open = 1): shows the
sales based on the
assortment level, when
stores are open.
(3) Sales by Categories over
StateHoliday: shows the
sales of the
combinations of
StoreType versus
Assortment with
StateHoliday and the
stores being open.
a b
c
a b c d
c
b
a
0
(4)
Assortment
a c a b c a c a c
StoreType
a b c d
Sales vs. Holiday Variables
(1) Boxplot of Sales by StateHoliday: Easter has the most
variability in sales, however, there are similar trends
between Easter, public holidays and Christmas
(StateHoliday = a, b, c).
(2) Boxplot of Sales by SchoolHoliday: similar trend
between a store being affected by a school holiday
(SchoolHoliday = 1) and not (SchoolHoliday = 0).
Major takeaways:
● There is variability in the sales during the holidays. There is more variability when there is no school holiday.
● The number of StateHolidays impacts the average sales.
● Easter and Christmas have higher average sales than a public holiday, given the store is open.
(4)
8
(1) (2)
(3) Sales by StateHoliday: Overall sales of stores without
holidays is higher since these stores are open more.
(4) Sales by StateHoliday (Open = 1): Sales were higher on
Easter and Christmas than a public holiday if the store was
open. (3)
(4)
0 a b c
0 a b c
0 a b c
Sales vs. DayOfWeek
(1)Sales by DayOfWeek (All): shows sales are lower on
Sundays since this is typically when stores are
closed. Also, Monday typically has higher sales than
other days of the week.
(2)Sales by DayOfWeek (Open = 1): displays the sales
of the days of the week when the stores are open,
including on Sundays. Mondays and Sundays are
typically higher in sales than other days.
(1)
(2)
9
(3)Sales by Promo (Open = 1): shows the sales versus if the store is
running a store-specific promotion that day. The graph indicates
that there are higher average sales when a promotion is in
place.
(4)Boxplot of Sales by Promo: shows a similar behaviour whether promo
= 0 (no in-store promo), or promo = 1 (in-store promo).
(5)Boxplot of Sales by Promo2: describes sales distribution by whether
stores has mailing promo and, if yes, by the type of mailing interval.
Bar plot of average sales shows similar pattern. Interestingly this
graph indicates that mailing campaign has negative impact on sales.
Sales vs. Promo
(4)(3) (5)
● The key indicates a binary variable, affectedbycompetition,
where 0 = not affected, and 1 = affected.
● It shows that there is no noteworthy trend between
distance and if a store is affected by the competition.
Sales vs. CompetitionDistance Sales vs. CompetitionAge (Days)
● Our analysis shows that the older the competition open
date is, the lower its effect on sales.
● If the competitor is new, then there is a higher impact on
the store’s sales (lower sales).
10
Sales vs. Other Variables
(1) Sales versus Week
by Year: displays a
general trend of
oscillation or cycle
until about week 29
for 2013, 2014, and
2015. There is also
an increase from
week 45 on for
2013 and 2014.
(1) Sales versus Month
by Year (Open = 1):
highlights an
increase in sales
from October to
December in 2013
and 2014 which
suggests
seasonality. Also,
there was a fall in
sales in May 2015.
11
Sales over Time
The following plots
display (1) the average
sales of all stores over
time, (2) the average sales
of all stores over time if
the Open = 1, and (3) the
average sales of the
categories*
*categories are the
combinations of the
assortments and store types
Major takeaways:
● All plots display the seasonality of the sales data.
● Although there are varying levels of average sales in plot 3, the trend is the similar amongst all categories.
● Category 5 has the highest sales. 12
(1)
(2)
(3)
Model Building & Comparison
Data Partition
The data was split in two ways:
1. Split Random: 80%:20% randomly
2. Split by Date (Out-Of-Sample):
a. Training set: 1 Jan 2013 - 13 Jun 2015.
b. Validation set: 14 Jun 2015 - 31 Jul 2015.
13
Model Type / Parameters
Validation (RMSE)
Split by Date
6-weeks
validation
Split
Random
Regression Linear 2616 2615
Polynomial (degree=3) - 2412
Decision Tree - 2574
Random Forest 1111 940
Neural
Network
MLP 4077 2734
LSTM Univariate (10
stores) (Epoch=200)
1562* N/A
LSTM Multivariate (10
stores)
1192* N/A
Time
Series
Auto ARIMA (Model
trained on each Store)
Mean = 1808
Median = 1716
N/A
SARIMA (Model trained
on each Store)
Mean = 2273
Median = 2164
N/A
Random Forest regression is the most optimal
model with the lowest RMSE for all stores. *Model built on sample stores as computational time takes too long
*Sample plot of model fit with Store 351
● Random Forest has the best fit for all stores with RMSE of
1111 when predicting 6-weeks forecast and 940 with random
split.
● It provides better results because multiple uncorrelated trees
act as one large committee so the output forecast would
definitely outperform the individual outputs.
Model Chosen: Random Forest
● This model overcomes overfitting by combining the results
of all the decision trees and works optimal even when the
missing values are greater.
● Random forest model has less variance when compared to
the other models.
*Sample plot of model
fit with Store 849
14
● Best forecasting method for 6-weeks
data is Random Forest Regression
algorithm which had an RMSE of
1111.
● More advanced machine learning
algorithms like LSTM have the
potential to provide better forecasts,
however, they require a longer
computation time
Assumptions & Limitations
● The residuals of our Random Forest (RF) fit spread quite randomly
around the 0 line. This suggest that the variances of errors are
equal.
● RF uses Bootstrap Aggregation and the common assumption that
it relies on is that sampling is representative.
● Forecasting using only historical data assumes that future
performance will be similar and does not consider market
changes.
● Our models have reliability for a 6-weeks prediction which is
sufficient and reasonable to predict the final test set.
* Sample plot of model fit with Store 351
15
Conclusion
Recommendations
● Enrich the data with more
information such as pricing,
products variety, launch of new
products, and location
● Consider reevaluating promotional
strategy (e.g. Promo2 has negative
impact on sales)
Thank you.
REFERENCES:
Koehrsen, Will. “Random Forest in Python.” Medium, Towards Data Science, 17 Jan. 2018, towardsdatascience.com/random-
forest-in-python-24d0893d51c0.
Mittal, Aditi. “Understanding RNN and LSTM.” Medium, Towards Data Science, 12 Oct. 2019,
towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e.
Torres, Miguel. “Facing the ARIMA Model against Neural Networks.” Medium, Towards Data Science, 19 Aug. 2019,
towardsdatascience.com/facing-the-arima-model-against-neural-networks-745ba5a933ca.

More Related Content

PDF
AI Strategy canvas V0.4
PPTX
An AI Maturity Roadmap for Becoming a Data-Driven Organization
PDF
Marketing Mix Modelling - Marketing Analytics Summit
PDF
AI STRATEGY CONSULTING: STEERING BUSINESSES TOWARD AI-ENABLED TRANSFORMATION
PDF
Inawsidom - Data Journey
PDF
AI Transformation
PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PPTX
Big Data and Advanced Analytics
AI Strategy canvas V0.4
An AI Maturity Roadmap for Becoming a Data-Driven Organization
Marketing Mix Modelling - Marketing Analytics Summit
AI STRATEGY CONSULTING: STEERING BUSINESSES TOWARD AI-ENABLED TRANSFORMATION
Inawsidom - Data Journey
AI Transformation
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Big Data and Advanced Analytics

What's hot (20)

PDF
Future of Digital Marketing [Free Download]
PDF
베트남 게임 시장 전망 (요약본)
PPTX
Communicate Data with the Right Visualizations
PDF
Go to market strategy
PPTX
Module 1 introduction to web analytics
PDF
Growth engine for saas startups
PPTX
digital marketing
PDF
Funnel Marketing
PPTX
10 omnichannel strategy essentials for 2021
PDF
Modernizing Our Data Platform
PDF
First Rule of Marketing Analytics: Forget the Customer - Digital Summit Phoenix
PPTX
Introduction to predictive modeling v1
PPTX
Introduction to digital marketing - mylivpro
PPTX
Social media analytics
PPTX
Marketing analytics Topics
PPT
Customer retention
PDF
Ling Shou Tong: Alibaba’s Next Innovative Disruptor?
PDF
Digital 2022 Sri Lanka (February 2022) v01
PPTX
Project on amazon
PPTX
Are you sure you have a strategy.pptx
Future of Digital Marketing [Free Download]
베트남 게임 시장 전망 (요약본)
Communicate Data with the Right Visualizations
Go to market strategy
Module 1 introduction to web analytics
Growth engine for saas startups
digital marketing
Funnel Marketing
10 omnichannel strategy essentials for 2021
Modernizing Our Data Platform
First Rule of Marketing Analytics: Forget the Customer - Digital Summit Phoenix
Introduction to predictive modeling v1
Introduction to digital marketing - mylivpro
Social media analytics
Marketing analytics Topics
Customer retention
Ling Shou Tong: Alibaba’s Next Innovative Disruptor?
Digital 2022 Sri Lanka (February 2022) v01
Project on amazon
Are you sure you have a strategy.pptx
Ad

Similar to Retail sales forecasting - predictive data analytics (20)

PPTX
Rossmann sales prediction..all_about_data_analysis.pptx
PPTX
Black Friday Shopping Prediction
PPTX
U23000754 data mining final project
PPTX
Black Friday Shopping Prediction_ PPT
PDF
IRJET- Retail Chain Sales Analysis and Forecasting
PPTX
Kaggle winning solutions: Retail Sales Forecasting
DOCX
Final Project Report - Walmart Sales
PPTX
Presentation1
PPTX
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
PPTX
Walmart sales forecasting
PPTX
BIG MART SALES PRIDICTION PROJECT.pptx
PPTX
BIG MART SALES.pptx
PPTX
Black Friday Sales Analytics
PPTX
Rossmann marketing strategy
PDF
Usa Retail Sales Analysis.pdf
PPTX
Ecommerce Sales Prediction using machine learning.pptx
PPTX
Walmart Sales Prediction
PPTX
Retail_Giant_Sales_Forecasting_Presentation_Sunil_Gupta.pptx
PPTX
Data Visualization: Sales forecasting
PDF
Sales prediction on black friday dataset using machine learning
Rossmann sales prediction..all_about_data_analysis.pptx
Black Friday Shopping Prediction
U23000754 data mining final project
Black Friday Shopping Prediction_ PPT
IRJET- Retail Chain Sales Analysis and Forecasting
Kaggle winning solutions: Retail Sales Forecasting
Final Project Report - Walmart Sales
Presentation1
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
Walmart sales forecasting
BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES.pptx
Black Friday Sales Analytics
Rossmann marketing strategy
Usa Retail Sales Analysis.pdf
Ecommerce Sales Prediction using machine learning.pptx
Walmart Sales Prediction
Retail_Giant_Sales_Forecasting_Presentation_Sunil_Gupta.pptx
Data Visualization: Sales forecasting
Sales prediction on black friday dataset using machine learning
Ad

Recently uploaded (20)

PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Introduction to Inferential Statistics.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
annual-report-2024-2025 original latest.
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
How to run a consulting project- client discovery
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Introduction to the R Programming Language
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft Core Cloud Services powerpoint
Introduction to Inferential Statistics.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
annual-report-2024-2025 original latest.
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
How to run a consulting project- client discovery
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
IMPACT OF LANDSLIDE.....................
Optimise Shopper Experiences with a Strong Data Estate.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to the R Programming Language
importance of Data-Visualization-in-Data-Science. for mba studnts
[EN] Industrial Machine Downtime Prediction
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg

Retail sales forecasting - predictive data analytics

  • 1. Drugstores Retail Sales Forecasting BMAN60422 Data Analytics for Business Decision Making MSc Business Analytics 2019/2020 Group 4
  • 2. Overview & Objective 1,115 health & beauty stores around Germany collected 31 months of data about the stores’ profile, their daily sales and customers, promotions, competitions and holidays during the period. Our team was tasked with creating a reliable model to forecast 6-week sales, to enable managers in increasing productivity, profitability, customer satisfaction and meeting demand. Methodology Identify problems & define objective Set Objectives ● Cleaning, enriching, transforming the data ● Feature engineering, selection & data partition Data Preprocessing Selecting, Identifying and Integrating Data Exploring, Visualising and Analysing Data Forecasting & Assessing Result Building, Validating and Comparing Models 2
  • 3. Data Identification & Integration Entity Relationship Diagram: Store.csv with Train.csv → Inner Join based on Store ID Column Type Store ID StoreType Nominal Assortment Nominal CompetitionDistance Ratio CompetitionOpenSinceMonth Nominal CompetitionOpenSinceYear Ordinal Promo2 Nominal Promo2SinceWeek Nominal Promo2SinceYear Ordinal PromoInterval Nominal Column Type Store ID DayOfWeek Nominal Date Interval Sales Ratio Customers Ratio Open Nominal Promo Nominal StateHoliday Nominal SchoolHoliday Nominal Store TrainStore ID 3
  • 4. Handling Missing Data (Values & Rows) (180 stores missing 184 dates of data) + (1 store missing 1 date of data)= 33121 missing rows Creating new rows to complete dataset: ● DayOfWeek ● StateHoliday ● SchoolHoliday ● Promo ● Open: Classify Stores by its opening pattern ● Sales & Customers: Prediction with Random Forest Follow pattern / schedule of available stores on the same date Missing values replaced by Mean 4 Missing values replaced by Median Missing values replaced by Median Missing values replaced with 1 Missing values replaced with 2016 Missing values replaced with 0 Missing values related to Promo2 only occurs when Promo2 = 0 which means the particular store doesn’t have mailing promotion. Thus we replaced PromoInterval to 0 and PromoSince to indicate that the promotion has not taken place.
  • 5. Variable Reduction & Creation 1. Dropped ‘Customer’, for the reason: ● Both Sales and Customers are dependant variables of which we do not know the values in the future ● There is a very clear positive relationship between the Customers & Sales (Correlation = 0.89). This reflects a leakage, and because variables that have leakage of the target value directly should not be used in the analysis, Customers should be dropped. 1. Dropped ‘Open’, for the reason: ● Variable Open shows quasi-complete separation in which no Sales occur when store is closed. We used Open to filter the training, validation and test set as we only want to forecast Sales when store is open. 3. Dropped variables which can be replaced by other variables: ● ‘CompetitionOpenSinceMonth’, ‘CompetitionOpenSinceDay’ & ‘CompetitionOpenSinceYear’ are dropped & replaced with a variable ‘CompetitionAge’ and ‘AffectedByCompetition’. ● ‘Promo2SinceWeek’, ‘Promo2SinceDay’ & ‘Promo2SinceYear’ are dropped & replaced with ‘AffectedByPromo2’. 4. Created variables ‘OpeningType’ (based on Stores’ opening pattern), ‘Month’, ‘DateOfMonth’ & ‘WeekOfYear’ to enrich the data. (1) (2) 5
  • 6. Data Transformation Transforming: Square Root CompetitionAgeCompetitionDistance Transformation (skew): - Log (-0.36) - Sqrt (1.16) - Box-Cox (-0.03) Best reduced skew: Box-Cox Skew: 2.93 Skew: 9.4 Skew: 0.48 6 Nominal variables are encoded using LabelEncoder to indicate them as categorical and boolean variables accordingly Ratio variables are transformed to reduce the effect of the outliers and their skewness
  • 7. (2) Sales vs. Categories 7 (1) (2) (3) Major takeaways: ● Plot (1) shows that store model ‘b’ has the highest sales, given the stores are open. ● Plot (2) shows that assortment level ‘extra’ (b) has the highest sales, given the stores are open. ● Plot (3) reiterates that StoreType = b has the highest sales. ● Plot (3) suggests the combination of StoreType = b and Assortment = c has the highest sales. This is different from when the two variables are assessed separately (see bullet point 2). (1) Sales by StoreType (Open = 1): shows the sales of store types when stores are open (2) Sales by Assortment (Open = 1): shows the sales based on the assortment level, when stores are open. (3) Sales by Categories over StateHoliday: shows the sales of the combinations of StoreType versus Assortment with StateHoliday and the stores being open. a b c a b c d c b a 0 (4) Assortment a c a b c a c a c StoreType a b c d
  • 8. Sales vs. Holiday Variables (1) Boxplot of Sales by StateHoliday: Easter has the most variability in sales, however, there are similar trends between Easter, public holidays and Christmas (StateHoliday = a, b, c). (2) Boxplot of Sales by SchoolHoliday: similar trend between a store being affected by a school holiday (SchoolHoliday = 1) and not (SchoolHoliday = 0). Major takeaways: ● There is variability in the sales during the holidays. There is more variability when there is no school holiday. ● The number of StateHolidays impacts the average sales. ● Easter and Christmas have higher average sales than a public holiday, given the store is open. (4) 8 (1) (2) (3) Sales by StateHoliday: Overall sales of stores without holidays is higher since these stores are open more. (4) Sales by StateHoliday (Open = 1): Sales were higher on Easter and Christmas than a public holiday if the store was open. (3) (4) 0 a b c 0 a b c 0 a b c
  • 9. Sales vs. DayOfWeek (1)Sales by DayOfWeek (All): shows sales are lower on Sundays since this is typically when stores are closed. Also, Monday typically has higher sales than other days of the week. (2)Sales by DayOfWeek (Open = 1): displays the sales of the days of the week when the stores are open, including on Sundays. Mondays and Sundays are typically higher in sales than other days. (1) (2) 9 (3)Sales by Promo (Open = 1): shows the sales versus if the store is running a store-specific promotion that day. The graph indicates that there are higher average sales when a promotion is in place. (4)Boxplot of Sales by Promo: shows a similar behaviour whether promo = 0 (no in-store promo), or promo = 1 (in-store promo). (5)Boxplot of Sales by Promo2: describes sales distribution by whether stores has mailing promo and, if yes, by the type of mailing interval. Bar plot of average sales shows similar pattern. Interestingly this graph indicates that mailing campaign has negative impact on sales. Sales vs. Promo (4)(3) (5)
  • 10. ● The key indicates a binary variable, affectedbycompetition, where 0 = not affected, and 1 = affected. ● It shows that there is no noteworthy trend between distance and if a store is affected by the competition. Sales vs. CompetitionDistance Sales vs. CompetitionAge (Days) ● Our analysis shows that the older the competition open date is, the lower its effect on sales. ● If the competitor is new, then there is a higher impact on the store’s sales (lower sales). 10
  • 11. Sales vs. Other Variables (1) Sales versus Week by Year: displays a general trend of oscillation or cycle until about week 29 for 2013, 2014, and 2015. There is also an increase from week 45 on for 2013 and 2014. (1) Sales versus Month by Year (Open = 1): highlights an increase in sales from October to December in 2013 and 2014 which suggests seasonality. Also, there was a fall in sales in May 2015. 11
  • 12. Sales over Time The following plots display (1) the average sales of all stores over time, (2) the average sales of all stores over time if the Open = 1, and (3) the average sales of the categories* *categories are the combinations of the assortments and store types Major takeaways: ● All plots display the seasonality of the sales data. ● Although there are varying levels of average sales in plot 3, the trend is the similar amongst all categories. ● Category 5 has the highest sales. 12 (1) (2) (3)
  • 13. Model Building & Comparison Data Partition The data was split in two ways: 1. Split Random: 80%:20% randomly 2. Split by Date (Out-Of-Sample): a. Training set: 1 Jan 2013 - 13 Jun 2015. b. Validation set: 14 Jun 2015 - 31 Jul 2015. 13 Model Type / Parameters Validation (RMSE) Split by Date 6-weeks validation Split Random Regression Linear 2616 2615 Polynomial (degree=3) - 2412 Decision Tree - 2574 Random Forest 1111 940 Neural Network MLP 4077 2734 LSTM Univariate (10 stores) (Epoch=200) 1562* N/A LSTM Multivariate (10 stores) 1192* N/A Time Series Auto ARIMA (Model trained on each Store) Mean = 1808 Median = 1716 N/A SARIMA (Model trained on each Store) Mean = 2273 Median = 2164 N/A Random Forest regression is the most optimal model with the lowest RMSE for all stores. *Model built on sample stores as computational time takes too long *Sample plot of model fit with Store 351
  • 14. ● Random Forest has the best fit for all stores with RMSE of 1111 when predicting 6-weeks forecast and 940 with random split. ● It provides better results because multiple uncorrelated trees act as one large committee so the output forecast would definitely outperform the individual outputs. Model Chosen: Random Forest ● This model overcomes overfitting by combining the results of all the decision trees and works optimal even when the missing values are greater. ● Random forest model has less variance when compared to the other models. *Sample plot of model fit with Store 849 14
  • 15. ● Best forecasting method for 6-weeks data is Random Forest Regression algorithm which had an RMSE of 1111. ● More advanced machine learning algorithms like LSTM have the potential to provide better forecasts, however, they require a longer computation time Assumptions & Limitations ● The residuals of our Random Forest (RF) fit spread quite randomly around the 0 line. This suggest that the variances of errors are equal. ● RF uses Bootstrap Aggregation and the common assumption that it relies on is that sampling is representative. ● Forecasting using only historical data assumes that future performance will be similar and does not consider market changes. ● Our models have reliability for a 6-weeks prediction which is sufficient and reasonable to predict the final test set. * Sample plot of model fit with Store 351 15 Conclusion Recommendations ● Enrich the data with more information such as pricing, products variety, launch of new products, and location ● Consider reevaluating promotional strategy (e.g. Promo2 has negative impact on sales)
  • 16. Thank you. REFERENCES: Koehrsen, Will. “Random Forest in Python.” Medium, Towards Data Science, 17 Jan. 2018, towardsdatascience.com/random- forest-in-python-24d0893d51c0. Mittal, Aditi. “Understanding RNN and LSTM.” Medium, Towards Data Science, 12 Oct. 2019, towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e. Torres, Miguel. “Facing the ARIMA Model against Neural Networks.” Medium, Towards Data Science, 19 Aug. 2019, towardsdatascience.com/facing-the-arima-model-against-neural-networks-745ba5a933ca.