You are on page 1of 5

Walmart Sales Prediction Using Support Vector

RegressionandMultivariateRegression

SubhashisHazarika
AbhishekDas
DeblinBagchi

Abstract
Walmart offers millions of items for sale each day. It has several branches with a number of
Departments in each branch. In addition, Walmart runs several promotional sales (markdown
events) throughout the year, most of which is just before prominent holidays like the Super Bowl,
Labor Day, Thanksgiving, and Christmas.

We are trying to model the effects of certain influencing features on the overall weekly sales.
Understanding and being able to better predict the outcome of these sales is important to buyers
and sellers alike for individual profit maximization.

We want to explore the use of Regression algorithms like SVMand Multivariate Linear Regression to
predict the sales of a Department on a particular day of the year.

Introduction and motivation
Walmart has been monopolizing the retail store business for nearly two decades. While discount
sales (markdown events) influence the final sales of a department, there exists variability in sale
prices that may be attributed to factors such as Temperature, CPI and even Fuel Price (can refer to
other input Features as well). In this project, we aim to build a model which identifies and utilizes
important item features to predict the sales of a particular department.

Armed with this knowledge, sellers can optimize their listing strategies so as to maximize estimated
profits, while buyers can more easily identify good and bad purchases.

The goal in any data analysis is to extract an accurate relationship between rawinformation and the
accurate estimation. An option to answer this question is to employ regression in order to model its
relationship. There are various types of regression analysis. The type of the regression model
depends on the type of the distribution of the outcome (Y). By modeling we try to predict Y based on
values of a set of predictor variables (Xi). These methods allow us to assess the impact of multiple
variables (covariates and factors) in the same model.
One of the methods we focus on is multivariate linear regression. Linear regression is the procedure
that estimates the coefficients of the linear equation, involving one or more independent variables
that best predict the value of the dependent variable.
Support Vector Machines (SVMs)
[3]
are learning machines implementing the structural risk
minimization inductive principle to obtain good generalization on a limited number of learning
patterns. Structural risk minimization (SRM) involves a simultaneous attempt to minimize the
empirical risk and the VC (VapnikChervonenkis) dimension. SVM implements a learning algorithm,
useful for recognizing subtle patterns in complex data sets. The algorithm performs discriminative
classification learning by example to predict the classifications of previously unseen data.

In this paper, we utilize the approaches of SVMand Multivariate regression to tackle the problem. In
the next section, we describe the dataset. Our Feature Engineering approaches and Experimental
Results are described in the two consecutive sections. We end with a short conclusion.

Dataset:
The entire data has been picked from the kaggle
[2]
data challenge. It comprises of data across 45
different Walmart stores, located at different places, each with 100 departments selling specific
classes of goods. We are given the weekly sales value of individual stores starting from 5th
February 2010 to 26th October 2012. We also have an extensive list of feature set for the stores. We
have the weekly values of Temperature, Fuel Price, Consumer Price Index, Unemployment and
whether it is a holiday week. We also have prices for five Markdown events that Walmart host for
clearance sales or for special holidays.

However, we were not able to do much with the Markdown values to build our predictor because
most of the Markdown values were missing and there was not much information on howthey should
be interpreted.

Feature Engineering:
In order to get an insight into which of the features play a crucial role in deciding the price we try to
find if there is any strong relationship between the features and the sales value.

a) Temperature : Temperature seems to have the strongest relation with sales values for
different departments across all the stores. To identify that we performed a normal linear
regression on the sales data of individual department.


Fig: The left plot shows how Weekly sales varies with temperature for a selected department across
different stores. The right is a similar plot for Unemployment effect on sales data.
(edit graphs)
b) Unemployment: Unemployment rate of a particular location seems to have the least impact
on the department sales data.

c) Fuel Price & CPI : They do seem to impact the weekly sales but they are not as good as the
temperature feature above.

d) Season: Since temperature plays a crucial role we thought that by classifying different times
of the year into the four seasons based on the temperature value we can come up with a
reasonable new feature.




Approach & Results:
We have divided the data into 3 parts, training set (all 2010 sales), validation set (all 2011 sales) and
the test set (remaining 2012 sales) and use the standard Mean Absolute Percentage Error (MAPE)
[3]

value to judge our predictor model. We used the R software for our project, which is a free software
for statistical computing and graphics. For the SVM implementation we used the standard LibSVM
library. The implementation and results of the experiments on our dataset are given below.

SVM(SVR)
[4]
Here we have used the standard support vector machine algorithm for regression which uses
insensitive loss function. This function allows a tolerance degree to errors not greater than .

The standard optimization problem is given in the following form:



We solve the constrained optimization problem of SVM training using the R language module for
LibSVM
[1]
. We have to provide suitable values to the following 2 parameters.

i) - regression error bound/tolerance.

ii) C - parameter which weighs how much the regression error exceed the value .

We train our SVM with different values of and C. LibSVM has a default of 0.1. We changed
the value of C and found that a value of 3 on a linear kernel gives the MAPE value of 0.39 for the
validation set and 0.41 for the test set.

Multivariate Regression
One of the most common approach towards solving sales predict problem is the multivariate
regression. We first look at each features separately and then the relationships among themselves.
The general regression model is represented as follows:

Y =
0
+
1
x
1
+
2
x
2
+ .. +
k
x
k
+


We look at the distribution of each feature to determine its importance in the regression analysis.
The feature vector included the temperature, unemployment rate, fuel price, CPI and the newly
constructed season feature. Based on the time of the year we created another parameter season
(like spring, winter, summer, autumn) because we felt that apart fromtemperature, seasonal sales of
particular department might be a crucial factor. The regression gives us 0.36 error rate.

Conclusion:
We would have had better results if we implemented a robust feature selection criteria. Filter
method, wrapper method, forward selection and backward elimination
[4]
are some of the most
common approaches, but due to time constraint we clung to a more ad-hoc approach of
understanding individual effects of features on sales pricing and hence deciding what feature to take
into consideration. However, as part of this project we have successfully implemented SVMfor sales
prediction problem, which is generally not the go-to approach for such problems.

References:
[1] C.-C Chang, C. -J Lin, LibSVM: a library for support vector machines, software availabe at,
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

[2] Kaggle walmart compedition :
http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

[3] Gianni di Pillo,et. al :An application of learning machines to sales forecasting under promotion

[4] Jose Guajardo et. al : A Forecasting Methodology using Support Vector Regression and Dynamic
Feature Selection.

You might also like