You are on page 1of 31

George Han

04/10/13
Regression and Multivariate Data Analysis STAT-UB 17
Homework 4
Professor Simonoff
Patterns in Time Ordered Data of Select Retail and Food Services
Every month in the U.S., billions of dollars are spent shopping for groceries. Similar
amounts of money are also spent shopping for other goods such as those pertaining to sports,
hobbies, books, music, furniture, and other miscellaneous categories. Are sales from one industry
associated with those from others?
This report will use statistical methods to test for associations between grocery sales and
other industries sales. It is interesting to look for these associations because insights may be
revealed into the forces that intertwine, on some level, different industries. The target variable
will be monthly sales from grocery stores in millions of dollars (henceforth grocery sales).
Potential predicting variables will be:

Monthly sales from stores that sell sporting goods, hobby goods, book goods, and music
goods, in millions of dollars (henceforth entertainment sales).
Monthly sales from stores that sell furniture and other home furnishings, in millions of
dollars (henceforth furniture sales).
Monthly sales from department stores, excluding leased department stores, in millions of
dollars (henceforth department store sales).

These time-ordered data were found on the website of the United States Census Bureau
(links at the end of this report), and have been adjusted for seasonality prior to online release*.
Therefore, the scope of the data is limited to sales from within the U.S. The time range of the
data extends from September 2010 (time = 1) to February 2013 (time = 30). An increment of 1
unit in the time variable corresponds to an increment of one month in real world time (henceforth
simply time). Consider the following three scatterplots of the target variable vs. each
predicting variable, and the following four time series plots of the target variable and each
predicting variable:

Grocery sales appears to be roughly correlated with entertainment sales and furniture
sales, and negatively correlated with department store sales with a seemingly weaker association
that may exhibit both nonlinearity and heteroscedasticity. The time series plot of entertainment
sales appears to be roughly linear. However, the other three time series plots display some degree

of potential nonlinearity. This means that it may be appropriate to apply transformational


techniques to the data, but first consider the following regression:

The least squares multiple regression line is:


Grocery Sales = 2.3207 x 104 + 1.9954 * Entertainment Sales + 1.5505 * Furniture Sales
2.0650 x 10-1 * Department Store Sales
The intercept coefficient is useless to interpret because it is nonsensical in the context of
the data: net entertainment sales, furniture sales, and department store sales are highly unlikely to
equal 0 in any month in any year.
Holding all other predictors constant:

An increase in monthly entertainment sales of $1 million is associated with an expected


estimated increase in monthly grocery sales of $1.9954 million, or approximately
$1,995,400.
An increase in monthly furniture sales of $1 million is associated with an expected
estimated increase in monthly grocery sales of $1.5505 million, or approximately
$1,550,500.
An increase in monthly department store sales of $1 million is associated with an
expected estimated decrease in monthly grocery sales of $2.0650 x 10-1 million, or
approximately $206,500.

The regression is of good strength with R2 = 0.819, meaning that 81.9% of the variability
in the target variable can be accounted for by the predicting variables. The residual standard error
is 571.6, representing the standard deviation of points formed around the regression line. This

means that 95% of actual target values should be within +/- 2 * residual standard error, +/1143.2, of the predicted target values. The p-values of the t-statistics for the each of the
regression coefficients represent the probabilities out of 1 of those coefficient values occurring
due to pure chance. They correspond to the following hypothesis tests:
HO: the regression coefficient of interest is 0 (the predictor has no association with grocery sales,
holding all other predictors equal)
HA: the regression coefficient of interest is different from 0 (the predictor has some association
with grocery sales, holding all other predictors equal)
In this report, regression coefficients will be considered statistically significant if they are
less than 0.05, the = 0.05 level of significance. It appears that only for entertainment sales (pvalue = 0.0406) can we reject the null hypothesis. For furniture sales (p-value = 0.0810) and
department store sales (p-value = 0.7648), we reject the alternate hypothesis. However the pvalue for furniture sales, 0.0810, is very close to 0.05, so the rejection of the alternate hypothesis
for furniture sales is a weak rejection. This means that entertainment sales has statistically
significant association with grocery sales, furniture sales may potentially have some association
with grocery sales, and department store sales has strong evidence of having no association with
grocery sales. The p-value of the F-statistic, less than 0.001, is significant at = 0.05, meaning
that the model as a whole is statistically significant. It corresponds to the following hypothesis
test:
HO: all regression coefficients are simultaneously 0 (model has no predictive capability)
HA: at least one regression coefficient is not 0 (model has some predictive capability)
We can reject the null hypothesis, meaning that the model as a whole has some predictive
capability.

The variance inflation factors do not indicate any statistically significant multicollinearity
between predictors because they are all below 10. Multicollinearity will be analyzed in greater
detail soon. Consider the following standardized residual plots:

The normal probability plot implies that the data are fairly normally distributed, but also
implies that the data have a slightly short left tail and a slightly fat right tail. This slight deviation
from normality could mean that estimates of regression coefficients may be slightly
inappropriate, because a part of the signal may be being mistakenly trated as noise. The residuals
vs. fitted values plot may indicate some potential non-linearity as evident in the standardized
residuals generally first being below 0, then above 0, and then below 0 again. It may also
indicate some potential heteroscedasticity, as evident in the roughly hourglass-shaped wideskinny-wide fanning in the standardized residuals as fitted values increase. This may mean that
estimates of regression coefficients may be less accurate, and also that predictive accuracy may
be incorrect. The histogram is approximately normally distributed, though it may be a little bit
skewed towards the left. The residuals vs. the order of the data indicates some potential positive
autocorrelation, because there do not appear to be that many runs and standardized residuals
appear to be correlated with those close to them to a certain degree. Consider the following
standardized residual plots vs. each predictor:

The standardized residual plots vs. entertianment sales and vs. furniture sales appear to
display similar patterns as does the standardized residuals vs. fitted values above. The
standardized residuals vs. department store sales plot may exhibit a little heteroscedasticity as
evident in the slight outward fanning, and may have a suspicious point in the lower right hand
corner. There do not appear to be any really obvious extreme values, but regardless, consider the
following series of diagnostic plots to assess the magnitude of any possible outliers or leverage
points.

The topmost plot shows the Cooks distances for each observation. These measure how
much an observation influences the fitted regression coefficients. Any observation with a Cooks
distance of above 1 should be studied further, and here, there are none.
The second plot from the top shows the standardized residuals for each observation.
These measure how far out an observation is from where the general regression should imply.
Any observation with standardized residual of +/- 2.5 should be studied further because that
implies that such an observation could only occur due to pure chance 1% of the time, and here,
that observation is observation 1 (standardized residual = -2.617).
The bottommost plot shows the hat values (Hi) for each observation. These, based on x
values, measure how far away particular cases are from the rest of the x variables, indicating
leverage. Any observation with hat value of 2.5((p + 1)/n) or greater, where p is the number of
predicting variables in the regression (3) and n is the total number of observations (30), so for
this regression the value would be 2.5((p + 1)/n) = 2.5((3 + 1)/30) = 0.33 3 , should be studied
further. Here, that observations is observation 16 (Hi = 0.339).
A table with specific values for diagnostic tests for potential extreme values (contains
observation numbers, standardized residual values, hat values, and Cooks distances, from left to
right):

It appears that the main observations of interest regarding extreme values are observation
1 and 16 (tabled below). Observation 1 is a statistical outlier due to its statistically significant

standardized residual. Observation 16 is a statistical leverage point due to its statistically


significant Hi. Observation 1 corresponds to September 2010. Observation 16 corresponds to
December 2011. There does not really appear to be any good explanations of why these
observations should be deemed extreme, because it does not appear that any significant real
world contextual events that occurred in these two months could have had any major impact on
grocery sales, entertainment sales, furniture sales, department store sales, and time. In these two
months, the only major significant event to involve the U.S. was that on December 15, 2011, the
U.S. formally declared an end to the Iraq War, but it is very unclear as to how this could have
impacted the predictors listed above. In conclusion, as both observations 1 and 16 are difficult to
explain as logical extreme values, both of them only have one of three extreme value diagnostic
tests flagged as statistically significant, and said flagged diagnostic test values are only barely
above the required threshold for statistical significance (-2.617 barely < -2.5 and 0.339 barely >
0.33 3 ), no action will be taken.
Observation
(Also Time)
1
16

Date
2010-09
2011-12

Grocery
Sales
43,604
46,334

Entertainment
Sales
6,745
6,942

Furniture
Sales
7,284
7,684

Department
Store Sales
15,334
15,364

Returning to the topic of autocorrelation, there may be some in the data, as indicated by
the residuals vs. order of the data plot above. Reproducing this plot:

Carrying out a parametric test for autocorrelation, consider the Durbin Watson statistic:

The Durbin-Watson statistic is 0.5018, with a p-value of less than 0.001, which is
significant at = 0.05. This corresponds to the following hypothesis test:
HO: errors are normally distributed and independent with expected mean 0 and constant standard
deviation 2 (N(0, 2)), and can be modeled with an AR(1) model with = 0 (there is no
statistically significant autocorrelation)
HA: errors can be modeled with an AR(1) model with =/= 0, i.e. i = i-1 + i, corr(i, i-r) = r
(there is indeed statistically significant autocorrelation).
Since the p-value of the Durbin-Watson statistic is very statistically significant, we
strongly reject the null hypothesis and state that there may indeed be statistically significant
autocorrelation. Since the Durbin-Watson statistic is 0.5018 which is closer to 0 than it is to 4 (0
= maximum positive autocorrelation, 2 = no autocorrelation, 4 = maximum negative
autocorrelation), the type of the autocorrelation is positive.
Carrying out a semi-parametric test for autocorrelation, consider the following
autocorrelation function plot (ACF plot) of the standardized residuals:

The ACF plot is slightly hanging at 1, indicating that differencing may be required.
Currently, there appears to be statistically significant positive autocorrelation at lags 1 and 2
(order-1 and order-2 autocorrelation).

The values of the order-1 and order-2 correlation, respectively, are 0.663 and 0.339. The
first value is very large, and the second is moderately large. The results from this test are
consistent with the results from the Durbin-Watson test in that both indicate a statistically
significant degree of positive autocorrelation.
Carrying out a non-parametric test for autocorrelation, consider the following runs test:

The runs test is statistically significant at = 0.05, with a standardized runs statistic of
-2.9729 and a corresponding p-value of 0.003. This is consistent with the results of the previous
two tests, and adds to the strength of the rejection of the null hypothesis above that there is no
statistically significant autocorrelation. What can be concluded is that there is indeed some
degree of autocorrelation which shows up as statistically significant for some tests, and that it
may be a good idea to difference the data.

But before transforming the model to address autocorrelation, it is imperative to first deal
with multicollinearity and variable selection, which affects the quality of regression on the
model, because it would be a bad idea to create transformations of repetitive predictors because
that would be even more repetitive. Below is a series of scatterplots of each predictor against
each other predictor (spor = entertainment sales, furn = furniture sales, dept = department store
sales):

Correlations in any of these scatterplots indicate potential multicollinearity.


Entertainment sales seem to be highly correlated with furniture sales. A matrix of the correlations
(Pearson correlation coefficients) of each predictor variable with each other predictor variable as
well as the corresponding p-values:

At the = 0.05 level of significance, all predictors appear to be statistically significantly


correlated to all other predictors with p-value less than 0.001! This is a tricky situation because
we cannot remove all three predictors from the model because then we would have no model.
This leads us to the issue of model selection which predictors, if any, should be omitted from
the model?
To determine the best model for this set of data, best subsets analysis will be used. A table
of the results from a best subsets regression:

In theory, the best model should lie in the column in which R2 levels off, adjusted R2 is
maximized, and Mallows C-p is minimized. Here, all R2 values are similar. Adjusted R2 is
maximized at the uppermost of the two-predictor models, at 80.5. However, this is not a
significant maximization because the runner up is very close at 79.8. The Mallows C-p is
minimized at 2.1, also at the uppermost of the two- predictor models, at 2.1. Therefore, according
to this best subsets regression, it appears that a model of two predictors is best, those two being
entertainment sales and furniture sales. This model has an R2 of 81.8.

The least squares multiple regression line is:


Grocery Sales = 1.9373 x 104 + 2.025 * Entertainment Sales + 1.610 * Furniture Sales
The intercept coefficient is useless to interpret because it is nonsensical in the context of
the data: net entertainment sales and furniture sales are highly unlikely to equal 0 in any month
in any year.
Holding all other predictors constant:

An increase in monthly entertainment sales of $1 million is associated with an expected


estimated increase in monthly grocery sales of $2.025 million, or approximately
$2,025,000.
An increase in monthly furniture sales of $1 million is associated with an expected
estimated increase in monthly grocery sales of $1.610 million, or approximately
$1,610,000.

Not much has changed. The regression is of good with R2 = 0.818 (originally 0.819),
meaning that 81.9% of the variability in the target variable can be accounted for by the predicting
variables. The residual standard error is 561.9 (originally 571.6), representing the standard
deviation of points formed around the regression line. This means that 95% of actual target
values should be within +/- 2 * residual standard error, +/- 1123.8 (originally 1143.2), of the
predicted target values. The p-values of the t-statistics for the each of the regression coefficients
represent the probabilities out of 1 of those coefficient values occurring due to pure chance. They
correspond to the following hypothesis tests:
HO: the regression coefficient of interest is 0 (the predictor has no association with grocery sales,
holding all other predictors equal)
HA: the regression coefficient of interest is different from 0 (the predictor has some association
with grocery sales, holding all other predictors equal)
It appears that only for entertainment sales (p-value = 0.0338, originally 0.0406) can we
reject the null hypothesis. For furniture sales (p-value = 0.0590, originally 0.0810) we reject the
alternate hypothesis. Both p-values have become more statistically significant. Again, the p-value
for furniture sales, 0.0590 is very close to 0.05, so the rejection of the alternate hypothesis for
furniture sales is a very weak rejection. This means that entertainment sales has statistically
significant association with grocery sales, and furniture sales may potentially have some
association with grocery sales. The p-value of the F-statistic remains the same at, less than 0.001,
significant at = 0.05, meaning that the model as a whole is statistically significant. It
corresponds to the following hypothesis test:
HO: all regression coefficients are simultaneously 0 (model has no predictive capability)
HA: at least one regression coefficient is not 0 (model has some predictive capability)
We can reject the null hypothesis, meaning that the model as a whole has some predictive
capability. Consider the following standardized residual plots:

The normal probability plot implies that the data are fairly normally distributed, but also
implies that the data have a slightly short left tail and a slightly fat right tail. This deviation in
normality could mean that estimates of regression coefficients may be sligtly inappropriate,
because a part of the signal may be being mistakenly trated as noise. The residuals vs. fitted
values plot may indicate some potential non-linearity and/or heteroscedasticity due to a similar
pattern as that evident in the original residuals vs. fitted values plot. Again, this may mean that
estimates of regression coefficients may be less accurate, and also that predictive accuracy may
be incorrect. The histogram is approximately normally distributed maybe a little more so than the
original, though it may still be a little bit skewed towards the left. The residuals vs. the order of
the data still indicates some potential positive autocorrelation, because there do not appear to be
that many runs and standardized residuals appear to be correlated with those close to them to a
certain degree. Consider the following standardized residual plots vs. each predictor:

Both plots display some degree of potential non-linearity and heteroscedasticity, as did
the original, meaning that estimates of regression coefficients may be less accurate, and also that
predictive accuracy may be incorrect. There do not appear to be any really obvious extreme
values, but regardless, consider the following series of diagnostic plots to assess the magnitude
of any possible outliers or leverage points, followed by a table with specific values for diagnostic
tests for potential extreme values (contains observation numbers, standardized residual values,
hat values, and Cooks distances, from left to right):

No observations have Cooks distances above 1. No observations have standardized


residuals more extreme than +/- 2.5. The topmost plot shows the Cooks distances for each
observation. Only one observation, number 16, has Hi above 2.5((p + 1)/n) = 2.5((2 + 1)/30) =
0.25, at Hi = 0.339. This means that observation 16 may be a statistical leverage point due to its
statistically significant Hi. But all in all, this diagnostic checking is even less significant than the
original diagnostic checking, so for reasons mentioned previously, no action will be taken.
Returning to the topic of autocorrelation, there may be some in the data, as indicated by
the residuals vs. order of the data plot. We expect that since no action has been taken to address
autocorrelation, there is still autocorrelation in the data. Reproducing this plot:

Carrying out a parametric test for autocorrelation, consider the Durbin Watson statistic:

The Durbin-Watson statistic is 0.5018, with a p-value of less than 0.001, which still is
significant at = 0.05. We reject the null hypothesis and state that there may indeed be
statistically significant autocorrelation (positive autocorrelation because 0.5018 < 2).
Carrying out a semi-parametric test for autocorrelation, consider the following
autocorrelation function plot (ACF plot) of the standardized residuals:

It is still hanging at 1, indicating that differencing may be required, and there still appears
to be statistically significant positive autocorrelation at lags 1 and 2 (order-1 and order-2
autocorrelation).

The values of the order-1 and order-2 correlation, respectively, are 0.646 and 0.378, less
than the original 0.663 and 0.339, respectively, meaning that there is a very slight reduction in
the amount of autocorrelation. Nevertheless, the amount of autocorrelation still remains
moderately large. The results from this test are consistent with the results from the DurbinWatson test.
Carrying out a non-parametric test for autocorrelation, consider the following runs test:

The runs test is still statistically significant at = 0.05, with a standardized runs statistic
of -2.9729 and a corresponding p-value of 0.003. This is consistent with the results of the
previous two tests, and adds to the strength of the rejection of the null hypothesis above that
there is no statistically significant autocorrelation. What can be concluded is that there is indeed
some degree of autocorrelation which shows up as statistically significant for some tests, and that
it may be a good idea to difference the data.
Now having finished variable selection, we will deal with autocorrelation. To do this, we
consider four paths of action: detrending, deseasonalizing, lagging, and differencing.
The time series plots of each variable all appear to have some correlation with time,
meaning that detrending may be appropriate. To do this, we could add a new time trend variable
such at simply time with time = 1 corresponding to September 2010, time = 30 corresponding
to February 2013, and an increment of 1 unit corresponding to an increment of one month in real
world time.
As particularly evident in the residuals vs. order of the data plot and the ACF plot,
seasonality does not appear to be an issue. This makes sense because the data are already
seasonally adjusted.
Lagging may be appropriate, because sales in one period may logically affect sales in the
next period, and also because the three autocorrelation diagnostic tests above are all in favor of
rejecting the null hypothesis and giving favor to the alternate hypothesis the errors follow an
AR(1) model with =/= 0. To utilize lagging, instead of using the variables entertainment sales
and furniture sales, we could use the variables lag of entertainment sales and lag of furniture
sales, as well as introduce a new variable of the lag of grocery sales, to predict grocery sales).
As evident in the slightly hanging ACF plot and possible heteroscedasticity in the
residuals vs. fitted values plot earlier on, differencing may also be appropriate. It may also make
contextual sense to difference the data because the changes in sales between different months
may be useful information. To utilize differencing, instead of using grocery sales as the target
variable, we could use the difference of grocery sales.
So, taking all this into account, in an attempt to deal with autocorrelation, we will analyze
a model with the following changes:
Previously: Grocery Sales = o + 1 * Entertainment Sales + 2 * Furniture Sales
Trasformed: Grocery Sales = 0 + 1 * Lag Entertainment Sales + 2 * Lag Furniture Sales + 3 *
Lag Grocery Sales + 4 * Time
The insertion of Lag Grocery Sales takes into account differencing because differencing
Grocery Sales is the same thing as transforming Grocery Sales to (Grocery Sales Lag Grocery
Sales), and then moving Lag Grocery Sales to the other side of the equation. Consider the
following regression:

The least squares multiple regression line is:


Grocery Sales = 1.740 x 104 3.616 x 10-1 * Lag Entertainment Sales 7.209 x 10-2 * Lag
Furniture Sales + 6.778 x 10-1 * Lag Grocery Sales + 5.312 x 101 * Time
The intercept coefficient is useless to interpret because it is nonsensical in the context of
the data: net entertainment sales, furniture sales, and department store sales are highly unlikely to
equal 0 in any month in any year.
Holding all other predictors constant:

An increase in monthly entertainment sales of $1 million during an arbitrary previous


month is associated with an expected estimated decrease in monthly grocery sales of the
following month of $3.616 x 10-1 million, or approximately $361,600.
An increase in monthly furniture sales of $1 million during an arbitrary previous month is
associated with an expected estimated decrease in monthly grocery sales of the following
month of $7.209 x 10-2 million, or approximately $72,090.
An increase in monthly grocery sales of $1 million during an arbitrary previous month is
associated with an expected estimated increase in monthly grocery sales of the following
month of $6.778 x 10-1 million, or approximately $677,800.
An increase in time by an increment of one month is associated with an expected
estimated increase in monthly grocery sales of $5.312 x 101 million, or approximately
$5,312,000.

The regression is of great strength with R2 = 0.975, meaning that 97.5% of the variability
in the target variable can be accounted for by the predicting variables. The residual standard error
is 201.6, representing the standard deviation of points formed around the regression line. This
means that 95% of actual target values should be within +/- 2 * residual standard error, +/- 403.2
of the predicted target values. This interval is almost three times smaller than those of the
previous regressions. The p-values of the t-statistics for the each of the regression coefficients
represent the probabilities out of 1 of those coefficient values occurring due to pure chance. They
correspond to the following hypothesis tests:
HO: the regression coefficient of interest is 0 (the predictor has no association with grocery sales,
holding all other predictors equal)
HA: the regression coefficient of interest is different from 0 (the predictor has some association
with grocery sales, holding all other predictors equal)
It appears that only for lag grocery sales (p-value = less than 0.001) can we reject the null
hypothesis at = 0.05. For lag entertainment sales (p-value = 0.3495), lag furniture sales (pvalue = 0.8829), and time (p-value = 0.1237), we reject the alternate hypothesis. This means that
grocery sales in a preceding period is the only predictor that has statistically significant
association with grocery sales in the following period. The p-value of the F-statistic is less than
0.001, significant at = 0.05, meaning that the model as a whole is statistically significant. It
corresponds to the following hypothesis test:
HO: all regression coefficients are simultaneously 0 (model has no predictive capability)
HA: at least one regression coefficient is not 0 (model has some predictive capability)
We can reject the null hypothesis, meaning that the model as a whole has some predictive
capability.

The variance inflation factors indicate statistically significant multicollinearity (VIF > 10)
between all predictors except Lag Grocery Sales. Multicollinearity will be analyzed in greater
detail soon. Consider the following standardized residual plots:

The normal probability plot implies that the data are fairly normally distributed, but also
implies that the data have a slightly fat left tail and a slightly short right tail. This slight deviation
from normality could mean that estimates of regression coefficients may be slightly
inappropriate, because a part of the signal may be being mistakenly trated as noise. The residuals
vs. fitted values plot is well behaved and indicates neither any noticeable heteroscedasticity nor
any noticeable non-linearity. The histogram, however, apperas to be somewhat right skewed. The
residuals vs. the order of the data not appear to indicate any noticeable autocorrelation! Consider
the following standardized residual plots vs. each predictor:

All these standardized residuals vs. predictors plots appear to be reasonably well
behaved. There may be one potential extreme value with a lrage standardized residual, so
consider the following series of diagnostic plots to assess the magnitude of any possible outliers
or leverage points.

No observations have Cooks distances over 1. Observation 8 appears to have a large


standardized residual that is above 2.5 (2.875). No observations appear to have Hi above 2.5((p +
1)/n) = 2.5((4 + 1)/29) = 0.43 1 , should be studied further. Here, that observations is
observation 17 (Hi = 0.432, but this is only extremely slightly above 0.43 1 ). Since these data
involved lagged variables, what shows up as observation x is actually observation x + 1.
A table with specific values for diagnostic tests for potential extreme values (contains
observation numbers, standardized residual values, hat values, and Cooks distances, from left to
right):

It appears that the main observations of interest regarding extreme values are observation
8 and 17 (tabled below). Observation 8 is a statistical outlier due to its statistically significant
standardized residual. Observation 17 is a statistical leverage point due to its statistically
significant Hi. Observation 8 corresponds to April 2011. Observation 17 corresponds to January
2012. There does not really appear to be any good explanations of why these observations should
be deemed extreme, because it does not appear that any significant real world contextual events
that occurred in these two months could have had any major impact on grocery sales,
entertainment sales, furniture sales, department store sales, and time. In conclusion, as both
observations 8 and 17 are difficult to explain as logical extreme values, both of them only have
one of three extreme value diagnostic tests flagged as statistically significant, and said flagged

diagnostic test values are only a small degree above the required threshold for statistical
significance 2.875 is not that much > 2.5 and 0.432 barely > 0.43 1 ), no action will be taken.
Observation
(Also Time)
8
17

Date
2010-04
2012-01

Lag Entertain.
Sales
7,046
6,942

Lag Furniture
Sales
7,471
7,684

Lag Grocery
Sales
45,372
46,334

Time
8
17

Returning to the topic of autocorrelation, even though it was attempted to remove it, there
may still be some left. We cannot carry out a Durbin-Watson test because we applied lagging to
the data, making the Durbin-Watson statistic meaningless. Consider the following semiparametric test for autocorrelation in the form of an (ACF plot) of the standardized residuals:

The ACF plot indicates no autocorrelation whatsoever at any lags! The problem of
autocorrelation appears to have been appropriately dealt with!

Carrying out a non-parametric test for autocorrelation, consider the following runs test:

The runs test is not statistically significant at = 0.3405, with a standardized runs statistic
of 0.9532, which is relatively close to 2 (no autocorrelation). This is consistent with the results of
the previous test, and adds to the strength of the rejection of the null hypothesis that there is no
statistically significant autocorrelation. What can be concluded is that autocorrelation has been
dealt with, and that no further transformations to deal with it are needed.
But, there may be problematic multicollinearity in the data, as indicated in the high VIFs.
This could affect the quality of regression on the model. This leads us again to the topic of model
selection. Below is a series of scatterplots of each predictor against each other predictor (LagSpo
Lag Entertainment Sales, LagFur = Lag Furniture Sales, LagGro = Lag Grocery Sales, Tim =
Time):

Correlations in any of these scatterplots indicate potential multicollinearity.


Unfortunately, again, all predictors appear to be highly correlated with all other predictors. A
matrix of the correlations (Pearson correlation coefficients) of each predictor variable with each
other predictor variable as well as the corresponding p-values:

At the = 0.05 level of significance, all predictors appear to be statistically significantly


correlated to all other predictors with p-value less than 0.001 (again!) Again, this is a tricky
situation because we cannot remove all predictors from the model because then we would have
no model. This leads us to the issue of model selection again which predictors, if any, should
be omitted from the model?
To determine the best model for this set of data, best subset analysis will be used. A table
of the results from a best subsets regression:

Again, in theory, the best model should lie in the column in which R2 levels off, adjusted
R2 is maximized, and Mallows C-p is minimized. Here, all R2 values are similar. Adjusted R2 is
maximized at two points, the uppermost of the two-predictor models and the bottommost of the
three-predictor models at 97.2. This is clearly not a significant maximization because not only is
it maximized in two places, but the runner ups are extremely close at 97.1. The Mallows C-p is
minimized at 2.1 (again), also at the uppermost of the two- predictor models, at 2.1. Therefore,
according to this best subsets regression, it appears that a model of two predictors is best, those
two being Lag Grocery Sales and Time. This model has an R2 of 97.4.

The least squares multiple regression line is:


Grocery Sales = 1.306 x 104 + 7.098 x 10-1 * Lag Grocery Sales + 3.402 x 101 * Time.
The intercept coefficient is useless to interpret because grocery sales are highly unlikely
to equal 0 in any month of any year. Holding all other predictors constant:

An increase in monthly grocery sales of $1 million during an arbitrary previous month is


associated with an expected estimated increase in monthly grocery sales of the following
month of $7,098 x 10-1 million, or approximately $709,800.
An increase in time by an increment of one month is associated with an expected
estimated increase in monthly grocery sales of $3.402 x 101 million, or approximately
$3,402,000.

Not much has changed. The regression is still great with R2 = 0.9739 (originally 0.975),
meaning that 0.9739% of the variability in the target variable can be accounted for by the
predicting variables. The residual standard error is 197.9 (originally 201.6), representing the
standard deviation of points formed around the regression line. This means that 95% of actual
target values should be within +/- 2 * residual standard error, +/- 395.8 (originally 403.2), of the
predicted target values. The p-values of the t-statistics for the each of the regression coefficients
represent the probabilities out of 1 of those coefficient values occurring due to pure chance. They
correspond to the following hypothesis tests:
HO: the regression coefficient of interest is 0 (the predictor has no association with grocery sales,
holding all other predictors equal)
HA: the regression coefficient of interest is different from 0 (the predictor has some association
with grocery sales, holding all other predictors equal)
It appears that all predictors are statistically significant at = 0.05, with p-values of less
than 0.001 and 0.0423, so for all predictors, we reject the null hypothesis. This means that Lag
Grocery Sales and Time have statistically significant association with grocery sales, and furniture
sales may potentially have some association with grocery sales. The p-value of the F-statistic

remains the same at less than 0.001, significant at = 0.05, meaning that the model as a whole is
statistically significant. It corresponds to the following hypothesis test:
HO: all regression coefficients are simultaneously 0 (model has no predictive capability)
HA: at least one regression coefficient is not 0 (model has some predictive capability)
We can reject the null hypothesis, meaning that the model as a whole has some predictive
capability. Consider the following standardized residual plots:

The normal probability plot displays a bit of potential non-normality, and implies that the
data have slightly fat right left and righ tails. The residuals vs. fitted values is reasonably well
behaved. The histogram does not look very normally distributed. The residuals vs. the order of
the data indicates that there is not any noticeable autocorrelation. Overall, the deviations from
normality mean that estimates of regression coefficients may be inappropriate, because a part of
the signal may be being mistakenly trated as noise. Consider the following standardized residual
plots vs. each predictor:

Both plots look well behaved. There do not appear to be any really obvious extreme
values, but regardless, consider the following series of diagnostic plots to assess the magnitude
of any possible outliers or leverage points, followed by a table with specific values for diagnostic
tests for potential extreme values (contains observation numbers, standardized residual values,
hat values, and Cooks distances, from left to right):

No observations have Cooks distances above 1. No observations have standardized


residuals more extreme than +/- 2.5. Observations 2 and 3 have Hi above 2.5((p + 1)/n) = 2.5((2
+ 1)/30) = 0.25, at Hi = 0.297 and 0.251, respectively. This means that observations 2 and 3 may
be statistical leverage points due to their statistically significant His. But all in all, this diagnostic
checking is again not very dramatic, because again, no significant events that could have
noticeably impacted sales (or time, obviously) have occurred in the months of the observations,
and that the diagnostic values of the observations were not extraordinarily extreme. So, no action
will be taken.
Also, multicollinearity will not be tested for because it has already been determined that
time is strongly correlated with lag grocery sales, which is expected because time is strongly
correlated with grocery sales. Furthermore, an additional best subsets regression will be not
needed because this model is the result of best subsets regression, and no modifications have
been made to that model after the previous best subsets regression.
In conclusion, in the U.S., monthly sales from grocery appears to have statistically
significant associations with monthly sales from grocery stores from the month before, and with
time. Months with higher grocery store sales tend to be followed by a month with similarly high
grocery store sales, and months with lower grocery store sales tend to be followed by a month
with similarly low grocery store sales. As time goes on, grocery store sales tends to be on the
rise. This is good news because it could mean that the standard of life is increasing because
people are generally feeling comfortable with purchasing more and more groceries as time

progresses. It could also mean that the United States is growing with regard to population
because more people need more groceries to survive. Also, the fact that months with higher
grocery store sales tend to be followed by a month with even higher grocery store sales may
imply that factors that impact grocery store sales are heavy and slow, so to speak, in that they are
not known for creating sudden change but rather gradual change that may last across periods of
time.
The model that models grocery sales as a function of lag grocery sales, time, and a
constant is reasonably strong. However, the slight deviations from normality as indicated in the
residual plots may mean that estimates of regression coefficients may be inappropriate, because a
part of the signal may be being mistakenly trated as noise. Also, the very statistically significant
multicollinearities in the model may mean that estimates of one predictors impact on the target
variable may be off. But, this multicollinearity is not surprising because it makes sense that time
is strongly correlated with lag grocery sales, because time is strongly correlated with grocery
sales.

The data:

http://www.census.gov/retail/marts/www/timeseries.html
http://www.census.gov/retail/marts/www/download/text/adv44x72.txt
http://www.census.gov/retail/marts/www/download/text/adv45100.txt
http://www.census.gov/retail/marts/www/download/text/adv44200.txt
http://www.census.gov/retail/marts/www/download/text/adv45210.txt
*: Directly quoted from the United States Census Bureau website containing the data:
Sales data are adjusted for seasonal, holiday, and trading-day differences, but not for
price changes. See the Adjustment Factors for Seasonal and Other Variations of Monthly
Estimates for more information.

You might also like